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ABSTRACT 


The  problem  of  Adaptive  Tine  Series  Analysis  and  System 
Identification  is  a  very  difficult  one, particularly  in  an 
environment  where  the  system  characteristics  are  changing 
with  time  or  the  system  is  in  a  closed-loop  configuration, 
such  as  the  problem  of  Adaptive  Flutter  Suppression  and 
Control  of  Large  Space  Structures.  The  main  objective  of 
this  project  was  to  investigate  the  theoretical  issues 
related  to  a  relatively  new  system  identification  technique, 
known  as  Canonical  Variate  Analysis  (CVA) ,  that  has  been 
successfully  used  in  such  environments.  In  this  technique, 
a  Markov  Model  is  extracted  from  an  observed  data  set  on  the 
basis  of  stochastic  realization  theory  and  statistical 
correlation  analysis.  The  optimal  model  order  is 
automatically  selected  using  an  information  theoretic 
criterion  known  as  Akaike  Information  Criteria  (AIC) .  The 
overall  technique  is  known  as  CVA-AIC  System  Identification 
Technique. 

The  CVA-AIC  technique  has  been  in  use  for  some  time 
primarily  as  an  ad-hoc  scheme.  Although  the  underlying 
theories  such  as  the  Stochastic  Realization  Theory  and 
Canonical  Correlation  Analysis  are  rigorously  established, 
various  aspect  of  CVA-AIC  technique  itself  have  not  been 
based  on  rigorous  theoretical  justifications.  For  this 
reason,  the  overall  effort  of  this  project  was  devoted 


■towards  theoretical  understanding  and  relationship  to  other 
identification  techniques.  In  particular,  the  relationship 
of  the  CVA-AIC  Technique  to  Maximum  Likelihood 
Identification  using  the  E-M  algorithmic  approach  has  been 
investigated.  It  has  been  shown  how  Maximum  Likelihood 
Estimates  can  be  obtained  starting  from  the  CVA-AIC 
solution.  The  E-M  Algorithmic  approach  also  suggests  real¬ 
time  recursive  algorithms. 

The  report  starts  with  a  clear  and  theoretically 
elegant  interpretation  of  AIC  in  a  discrete  valued  random 
variable  framework.  The  relationship  between  the  state- 
space  model  obtained  by  the  CVA-AIC  technique  and  the 
standard  Kalman  Filter  form  has  been  explored.  In  many 
situations,  models  of  all  orders  are  needed  and,  therefore, 
a  simple  algorithm  that  is  recursive  in  the  model  order  has 
been  developed  without  matrix  inversion  at  each  step.  The 
control  engineers  often  rely  on  the  confidence  bands  around 
the  power  spectral  density  function  of  noise  and  transfer 
function  of  the  system.  It  has  been  shown  how  to  compute 
this  band  at  various  confidence  levels  around  the  true 
spectral  density  and  transfer  function.  A  major  emphasis  has 
been  placed  upon  time  varying  dynamic  systems  with  slowly 
varying  and  abrupt  changes.  It  has  been  shown  that  by 
selecting  moving  windows,  the  slowly  time  varying  parameters 
in  the  system  can  be  tracked  and,  by  appropriately 
partitioning  the  data,  the  computed  AIC  from  each  partition 


CHAPTER  1 


1.  INTRODUCTION 

1 . 1  Overview  of  the  Adaptive  Tine  Series  Analysis  Problem 

Adaptation  in  time  series  is  an  important  problem  in  a  number  of  DOD 
systems  and  has  many  applications  in  various  commercial  industries.  This 
is  an  especially  difficult  problem  in  problems  requiring  realtime  adap¬ 
tation  to  process  changes  since  such  a  procedure  would  have  to  be  comple¬ 
tely  automatic  and  reliable.  Adaptation  is  necessary  in  systems  where  the 
dynamical  characteristics  change  with  time  in  unpredictable  ways,  or  where 
the  noise  disturbance  process  characteristics  vary  with  time.  Examples  of 
systems  that  require  adaptive  time  series  analysis  are  the  adaptive 
suppression  of  aircraft  wing  flutter,  identification  of  the  dynamics  of 
large  flexible  space  structures,  detection  of  failures  in  aircraft  from 
subsystem  failures  of  battle  damage,  identification  of  missile  aerodyna¬ 
mics,  target  tracking,  and  various  signal  processing  problems. 

The  solution  to  the  adaptive  time  series  analysis  requires  several 
advances  in  current  time  series  methods.  At  the  core  of  the  problem  is  the 
need  for  a  fundamental  statistical  approach  to  the  adaptation  problem  that 
poses  the  problem  in  a  meaningful  way  and  that  leads  to  computable  solu¬ 
tions.  To  solve  the  online  adaptation  problem,  a  reliable  and  automatic 
time  series  modeling  procedure  is  required  that  is  lacking  in  previous 
methods.  The  current  research  provides 

•  A  sound  statistical  basis  for  posing  and  solving  the  adaptation 
problem 
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•  A  numerically  and  statistically  reliable  online  computational 


procedure 

This  approach  has  been  used  in  conjunction  with  a  new  high  resolution 
system  identification  method  utilizing  canonical  variate  analysis  (CVA)  for 
the  determination  of  the  dynamics  of  high  order  multisensor  systems  with  a 
small  data  length  (Larimore,  1983b).  This  algorithm  can  be  implemented  on 
highly  parallel  processors  such  as  a  systolic  array.  This  makes  practical 
the  consideration  of  many  different  system  characteristics  to  determine  the 
best  for  modeling  the  observed  sensor  data  and  correlational  relationships 
between  the  many  sensors.  The  system  characteristics  that  have  been  suc¬ 
cessfully  determined  adaptively  are  the  dynamical  state  order  of  the 
system,  the  presence  of  correlated  disturbances,  the  optimal  data  length  to 
use  in  tracking  a  time  varying  system,  and  the  optimal  data  interval  for 
detection  of  an  abrupt  change  or  other  event  in  the  data. 

The  CVA  time  series  analysis  method  has  been  applied  to  the  design  of 
an  adaptive  flutter  suppression  problem  for  suppressing  wing  flutter  or 
aero-structural  vibration  in  aircraft.  While  considerable  progress  has 
been  made  in  the  problem  of  adaptation  in  terms  of  identification  of  time 
series  models,  adaptive  time  series  methods  which  can  efficiently  track  and 
detect  time  varying  processes  would  further  improve  the  system.  In  such  a 
system  the  wing  dynamical  characteristics  can  change  instantaneously  when  a 
wing  store  is  dropped,  and  the  new  wing  dynamics  are  unknown  and  may  be 
unstable  resulting  In  a  growing  oscillation.  If  the  unstable  mode  is  not 
detected,  accurately  identified,  and  stabilized  by  control  feedback  in  less 
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chan  a  second,  then  Che  aircraft  can  lose  a  wing.  The  CV A  algorithm  using 
entropy  methods  for  deciding  model  state  order  are  being  implemented  on  a 
vector  array  processor  which  will  identify  high  order  systems  with  dozens 


of  dynamical  states  and  multiple  inputs  and  outputs  in  fractions  of  a 
second.  This  system  has  been  tested  in  real-time  simulations,  and  was  suc¬ 
cessfully  demonstrated  in  wind  tunnel  tests  at  the  NASA  Langley  Transonic 
Dynamics  Wind  Tunnel.  It  is  expected  that  highly  parallel  processors  such 
as  systolic  array  processors  could  result  in  a  speedup  of  many  thousands  of 
times  which  would  be  required  for  some  very  large  scale  real  time  adaptive 
problems. 

1 • 2  Signal  and  Fault  Detection 


A  Comprehensive  survey  of  fault  detection  methods  is  given  by  Willsky 
(1976).  See  also  Mehra  and  Peschon  (1971),  Willsky  and  Jones  (1974), 
Willsky  (1980),  and  Isermann  (1984).  The  type  of  abrupt  changes  in  a 
system  that  are  considered  are  of  the  form 

x(t+l)  =  *x(t)  +  Gu(t)  +  w(t)  +  m(t)  (1.1) 

y(t)  =  Hx(t)  +  Au( t )  +  Bw(t)  +  v(t)  +  N(t)  (1.2) 

where  u  is  the  input  vector  process,  y  is  the  output  vector,  x  is  the  state 
vector,  and  w  and  v  are  white  noise  processes  that  are  independent  with 
covariance  matrices  Q  and  R  respectively.  These  white  noise  processes 
model  the  covariance  structure  of  the  error  in  predicting  y  from  u.  The 
abrupt  changes  are  in  the  form  of  the  time  the  functions  m(t)  and  n(t) 
introduced  into  the  state  and  observation  equations.  Fault  detection  is 


thus  the  detection  of  the  presence  of  such  nonzero  functions. 


For  various  hypothesized  forms  of  the  functions,  i.e.  ,  for  jumps  in 
various  components  or  specific  combinations  of  the  components,  a  particular 
detection  computation  is  devised  which  requires  implementation  of  a  Kalman 
filter.  This  leads  to  statistically  most  powerful  likelihood  ratio  tests 
of  the  various  failure  hypotheses.  An  optimal  solution  to  the  failure 
detection  problem  formulated  in  (1.1)  and  (1.2)  is  thus  obtained. 

There  are  however  several  more  general  failure  detection  problems  not 
of  the  form  of  (1.1)  and  (1.2).  The  approach  permits  only  the  consideration  of 
simple  hypotheses,  i.e.,  where  the  failure  functions  m(t)  and  n(t)  are  of 
the  form  of  an  unknown  scalar  amplitude  parameter  multiplying  a  function  of 
known  form.  More  general  functional  forms  such  as  two  components  with  dif¬ 
ferent  unknown  amplitude  parameters  multiplying  the  known  functions 
requires  maximum  likelihood  parameter  identification  at  considerable  com¬ 
putational  expense  and  loss  of  numerical  reliability.  Furthermore,  the 
problem  of  unknown  failure  time  leads  to  a  considerable  increase  in  the 
required  computation,  and  no  theoretically  sound  decision  procedure  has 
been  proposed  for  choosing  the  failure  time. 

The  general  case  of  changes  in  the  system  dynamics  or  correlational 
characteristics  of  the  disturbance  or  measurement  noise  processes  cannot  be 
handled.  Such  cases  require  general  time  series  analysis  parameter  iden¬ 
tification  methods  which  are  not  reliable  for  online  application  to  high 
state  order  multivariable  systems  as  discussed  in  Section  Multisensor 
System  Identification.  Isermann  (1984)  gives  a  survey  of  current  fault 
detection  methods  and  concludes  that:  "A  unique  calculation  of  the  process 
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coefficients  and  a  parameter  estimation  with  high  precision  is  only 
possible  for  low  order  elements  between  measured  variables.  Therefore  the 
measured  variables  should  be  selected  such  that  the  process  is  divided  in 
first  order  elements  or,  in  other  words,  all  state  variables  should  be 
measurable.  Easy  to  implement  parameter  estimation  methods  for  continuous¬ 
time  modles  to  be  used  on-line,  real-time  and  in  closed  loop  need  to  be 
developed. "  The  requirement  of  measuring  all  of  the  states  is  not 
realistic  in  most  situations  especially  in  general  multivariate  time  series 
and  system  identification  problems.  Fortunately,  the  CVA  system  iden¬ 
tification  method  does  not  require  this,  but  indeed  is  an  online,  real-time 
method  that  gives  the  same  accuracy  in  either  open  or  closed  loop. 

The  issues  of  adaptation  are  not  addressed  in  the  fault  detection 
literaure  except  in  simplistic  ways.  The  present  state  of  the  art  in  adap¬ 
tation  for  failure  detection  appears  to  be  the  work  of  Hagglund  (1983) 
discussed  in  the  next  section,  and  is  just  beginning  of  adaptive  approaches 
which  consider  fundamental  issues  in  adaptation. 

1.3  Adaptation  to  Changing  Processes 

Concepts  of  adaptive  systems  have  been  around  since  the  1950's 
involving  various  senses  of  adaptation.  The  present  literature  on  the  sub¬ 
ject  includes  a  number  of  methods  such  as  recursive  computational  schemes, 
exponential  forgetting,  lattice  computational  methods,  etc.,  which  have 
certain  "knobs”  that  allow  tuning  of  the  algorithm  to  accommodate  changes 
in  the  characteristics  of  the  actual  processes.  Reviews  of  these  and 
related  methods  are  contained  in  several  recent  special  issues  of  technical 
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journals  and  books  (Special  Issue  on  Adaptive  Control,  Automatica,  Vol.  20, 
No.  5,  1985;  Special  Issue  on  Linear  Adaptive  Filtering,  IEEE  Trans,  on 
Information  Theory,  Vol.  30,  No  2,  1984;  Honing  and  Messerschmitt ,  1984). 
t^hile  these  methods  do  permit  some  degree  of  adaptation  to  process  changes, 
the  methods  of  adaptation  are  ad  hoc,  and  no  sound  underlying  statistical 
principle  for  adaptation  is  proposed  or  demonstrated.  As  might  be 
expected,  these  methods  can  work  poorly  on  certain  cases  because  of  the 
lack  of  a  sound  statistical  basis. 

In  particular,  the  recursive  prediction  error  and  lattice  methods  are 
convenient  due  to  their  recursive  form  and  provide  an  estimate  at  every 
observation  (Friedlander,  1982a,  1982b,  1983;  Ljung  and  Soderstrom,  1983). 
Also,  the  recursive  algorithms  can  be  sued  for  adaptation  by  exponential 
weighting  of  the  past  data  (Wellstead  and  Sanoff,  1981;  Irving,  1979;  Evans 
and  Betz,  1982).  But  the  rational  for  exponential  weighting  has  not  been 
given  a  sound  fundamental  justification,  but  is  used  largely  due  to  its 
ease  of  use.  the  choice  of  the  exponential  weight  has  been  ad  hoc  and 
susceptible  to  misinterpretation  of  changing  noise  variance  levels  as  time 
varying  changes  in  the  dynamics  (Hagglund,  1983). 

The  fundamental  problem  in  adaptive  time  series  analysis  is  adaptation 
to  time  varying  processes.  The  essential  problem  is  the  determination  of 
the  characteristics  describing  the  rate  at  which  the  process  is  changing. 
This  problem  has  received  very  little  in-depth  treatment  in  the  literature. 
Most  of  the  difficulty  can  be  attributed  to  the  discrepancy  between  the 
true  and  assumed  uncertainty  in  the  measurements.  Adaptive  control  schemes 
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are  notoriously  optimistic  about  the  quality  of  the  parameter  estimates 
because  the  time  varying  nature  of  the  process  is  ignored. 

A  notable  exception  is  the  recent  work  of  Hagglund  (1983)  which  takes 
an  information  handling  point-of-view.  This  approach  leads  to  a  more 
realistic  appraisal  of  the  accuracy  of  the  parameter  estimates  and  con¬ 
sequently  the  value  of  new  measurements  which  become  available  in  time. 

Two  classes  of  time  varying  systems  are  considered: 

•  Processes  with  abrupt  changes 

®  Processes  with  slowly  varying  changes. 

Within  each  of  these  classes,  changes  are  considered  in  the  process  dyna¬ 
mics  and/or  noise  variance. 

For  abrupt  changes,  the  fault  detection  approach  is  taken.  The  central 
idea  is  to  monitor  differential  changes  in  the  parameter  estimates  to 
detect  abrupt  changes.  A  new  procedure  is  derived  by  Hagglund  which 
requires  no  apriori  information  and  is  very  sensitive  to  jumps  in  the  para¬ 
meters.  This  procedure  is  shown  to  have  very  good  properties  in  both 
theory  and  practice.  This  works  well  for  parameters  of  the  dynamics  as 
well  as  those  of  the  noise  variances  in  the  simple  cases  of  low  order 
systems. 

The  problem  of  slowly  varying  parameters  has  plagued  many  adaptive 
control  schemes.  Although  the  concept  of  discounting  the  old  data  using  a 
forgetting  factor  has  been  in  use  for  a  long  time,  the  problem  of  how  to 
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relate  this  factor  to  the  data  has  been  elusive.  The  principal  proposed  by 
Hagglund  is  to  discount  past  data  in  such  a  way  that  a  constant  amount  of 
information  would  be  retained  if  the  parameters  were  constant.  The  quan¬ 
titative  measure  of  the  information  used  is  the  inverse  of  the  parameter 
estimation  error  covariance  matrix  which  is  the  Fisher  information  matrix. 
Theory  and  simulations  show  that  this  works  quite  well  in  low  order  and 
well  conditioned  systems.  However  for  high  order  and  multisensor  systems 
with  illconditioned  parametric  structure,  the  algorithms  are  not  so  well 
behaved. 

1 . 4  Multisensor  System  Identification 

System  parameter  identification  from  observed  measurements  is  a  crucial 
part  of  the  adaptive  multivariate  timeseries  analysis  problem.  It  is 
necessary  to  adapt  anot  only  to  changes  in  the  input  to  output  charac¬ 
teristics  of  a  system,  but  the  correlational  characteristics  of  the  distur¬ 
bance  and  noise  processes  must  simultaneously  be  determined.  The 
feasibility  of  adaptive  methods  requires  first  that  a  reliable  online 
multivariate  time  series  identification  procedure  be  available. 

There  are  several  difficulties  with  currently  available  methods  and 
software  for  the  identification  of  system  dynamics  and  noise  charac¬ 
teristics.  Current  methods  include  the  self  tuning  regulator  (STR)  (Ljung, 
1983;  Astrom,  1973;  Astrom  et  al,  1973,  1977),  maximum  likelihood  estima¬ 
tion  (MLE)  (Mehra  and  Tyler,  1973;  Larimore,  1981a),  Box-Jenkins  (BJ) 
methods  (Box  and  Jenkins,  1976),  and  a  variety  of  heuristic  approaches. 

The  current  state  of  the  art  In  both  MLE  and  BJ  require  that  an  analyst  be 
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involved  in  the  procedure,  and  the  required  number  of  computational  itera¬ 
tions  is  not  bounded.  The  STR  has  been  applied  successfully  to  simple  pro¬ 
cesses,  but  is  not  completely  reliable  for  general  processes  particularly 
when  multi-input,  multi-ouput  systems  are  involved.  In  addition,  the 
recursive  prediction  error  algorithm  used  in  the  STR  requires  a  good  ini¬ 
tial  estimate  and  so  is  not  suitable  for  short  data  where  no  apriori  data 
is  available.  The  heuristic  approaches  tend  to  be  special  purposes  and  are 
rather  unreliable  in  general  applications. 

Of  the  current  approaches  to  multivariate  time  series  identification 
which  are  high  resolution,  i. e.  ,  make  efficient  use  of  the  observational 
information,  most  use  the  ARMA  (autoregressive  moving  average)  represen¬ 
tation  for  the  process.  For  multi-input  multi-output  systems  this  is  not  a 
globally  well  defined  parameterization  which  is  a  major  cause  of  the  dif¬ 
ficulties  in  the  present  identification  methods  (Gevers  and  Wertz,  1982).  , 

consequence  is  that  there  is  no  single  parameterization  which  is  numeri¬ 
cally  well  conditioned,  and  known  algorithms  can  be  made  to  fail  for  a  par¬ 
ticular  choice  of  system.  The  system  identification  problem  is  well 
defined  in  that  the  class  of  models  does  have  best  models  in  a  maximum 
likelihood  sense  (Larimore,  1981a),  but  the  ARMA  parameterization  is  not 
unique  so  that  for  cases  such  as  pole-zero  cancellation  there  is  a  whole 
equivalence  class  of  models  with  equivalent  characteristics.  In  the  sequel 
this  difficulty  in  parameterization  will  be  resolved  by  the  use  of  state 
space  models,  and  stable  numerical  methods  will  be  described  for  statisti¬ 
cally  reliable  online  identification  of  multivariable  time  series. 


I 

1 . 5  Adaptive  Time  Series  Analysis  Using  Predictive  Inference  and  Entropy 

Recently  a  very  general  predictive  inference  approach  to  statistical 
modeling  has  led  to  a  fundamental  statistical  inference  justication  of 
negative  entropy  as  the  natural  measure  of  model  approximation  error 
(Larimore,  1983a).  This  development  has  a  number  of  very  attractive 
features : 

•  It  applies  to  completely  general  modeling  problems  including 
nonparametric  methods. 

•  It  applies  exactly  to  small  samples. 

•  Only  the  fundamental  statistical  principlaes  of  sufficiency  and 
repeated  sampling  are  used. 

•  It  applies  to  time  correlated  problems  such  as  time  series  model 
identification  and  tracking. 

•  Statistical  inference  can  be  fundamentally  viewed  as  model 
approximation. 


Early  developments  in  predictive  distributions  are  very  old,  although 
modern  approaches  apparently  begin  with  Jeffreys  (1961,  p. 143)  who  used  a 
Bayesian  approach,  as  has  much  of  the  work  following  (Atchison  and 
Dunsmore,  1975,  preface  and  p.  39).  The  approach  taken  here  has  been  sti 
mulated  by  Murray  (1977,  1979),  the  work  of  Akaike  (1973)  and  model  struc 
ture  determination  problems  (Larimore,  1977a). 


SSI  has  been  in  the  forefront  in  developing  the  CVA  and  entropy 
metnods.  Here  the  related  projects  are  discussed  along  with  preliminary 
results  indicating  the  feasibility  of  the  proposed  methods. 

The  original  stochastic  realization  method  of  Akaike's  (1975)  was 
further  developed  into  a  commercial  software  package  for  mainframe  and  mini 
computers  by  Mehra  (1978)  and  Mehra  and  Cameron  (1976,  1980).  Further 
generalizations  to  input  output  systems  along  with  refinements  in  com¬ 
putational  speed  and  accuracy  were  developed  by  larimore  (1983b)  and 
Goodrich  and  Larimore  (1983)  leading  to  the  current  timeseries  analysis  and 
forecasting  package,  Forecast  Master  (Trademark  of  SSI),  for  the  IBM/PC. 
This  package  is  in  widespread  use  in  utilities,  banks  companies  and 
universities. 

This  algorithm  has  been  the  basis  for  several  studies  in  online  systems 
identification.  The  project  "Basic  Research  in  Adaptive  Model  Algorithmic 
Control"  used  the  online  CVA  system  identification  algorithm.  In  the 
current  study  "Reconfiguration  Control  Strategies”,  the  CVA  method  along 
with  adaptive  tracking  and  detection  methods  are  being  studied.  The  pre¬ 
sent  theory  on  adaptation  using  entropy  methods  (Larimore,  1985a)  was  deve¬ 
loped  under  the  basic  research  study  "traget  Dynamic  Modeling"  and  under 
the  study  "Development  of  Statistical  Methods  Using  Predictive  Inference 
and  Entropy"  which  was  Phase  I  of  this  proposed  Phase  II  study. 

A  review  of  the  technology  in  system  identification  and  adaptive 
control  for  adaptive  methods  applicable  to  the  suppression  of  aeroelastic 
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wing  vibration  (flutter)  was  done  in  Larimore  and  Mehra  (1984).  This  study 
describes  the  deficiencies  of  current  methods  and  suggets  the  feasibility 
of  CV A  and  entropy  methods  for  fully  adaptive  online  detection  and  tracking 
of  wing  flutter.  In  a  current  study  with  General  Dynamics  sponsored  by  the 
Air  Force  Wright  Aeronautical  Laboratories,  CVA  has  been  analyzed  exten¬ 
sively  in  computer  simulations,  real  time  tests,  and  demonstrated  wind  tun¬ 
nel  teste  for  adaptive  flutter  suppression.  The  ability  of  CVA  to  identify 
very  complex  flutter  dynamics  of  high  state  order  involving  very  closely 
spaced  spectral  peaks  in  the  presence  of  correlated  wind  gust  disturbances 
using  short  data  lenghts  demonstrated  the  consderable  statistical  accuracy 
of  the  method.  The  online  CVA  identification  algorithm  was  demonstrated 
ina  wind  tunnel  test  at  the  NASA  Langley  Transonic  Dynamics  Wind  Tunnel  on 
a  1/4  scale  model  of  an  F-16  aircraft. 

1 . 7  Synopsis  of  Report 

In  Section  2,  we  present  a  detailed  and  transparent  derivation  of  an 
unbiased  entropy  measure  which  will  be  used  in  the  sequel  for  adaptive 
estimation.  This  measure  is  asymptotically  equal  to  Akaike's  AIC  cri¬ 
terion.  In  Section  3,  we  present  a  detailed  description  and  derivation  of 
linear  least-squares  prediction  using  canonical  variates  analysis  (CVA). 
Several  new  forms  for  these  predictors  are  given.  In  Section  4,  a  method 
for  direct  determination  of  the  parameters  of  the  Kalman  filter  in  canoni¬ 
cal  form  is  given,  and  is  shown  to  be  equivalent  to  a  truncated  optimal 
linear  predictor  derived  using  CVA.  Section  5  considers  the  model  order 
selection  problem,  using  an  entropy-based  approach.  The  problem  of  abrupt 
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change  detection  using  entropy  methods  is  considered  in  Section  6  and  a 
specific  algorithm  is  derived  and  tested.  In  Section  7  we  consider  the 
problem  of  slow  change  detection,  specifically  the  problem  of  finding  the 
optimal  data  length  for  model  fitting  when  the  time  series  coefficients  are 
slowly  varying.  An  entropy-based  algorithm  is  developed  and  tested. 


CHAPTER  ? 


I 

I 


2.  PREDICTIVE  INFERENCE  AND  ENTROPY 


2. I  Introduction 

In  this  section  we  develop  the  necessary  background  for  development  of 
adaptive  estimation  algorithms  in  the  sequel. 

The  problem  under  consideration  is  that  of  predicting  the  future  evolu¬ 
tion  of  a  time  series,  given  some  observations  of  the  past.  The  predictive 
inference  framework  may  be  described  as  follows. 


We  assume  that  the  density  function  of  interest  is  parametrized  by  a 
parameter  vector  0  £  Rm  and  is  denoted  by  p(x  j  0).  For  the  purposes  of 
discrimination  between  two  alternatives  0^  and  0o  it  can  be  shown  (Akaike, 
1973)  that  all  necessary  information  is  contained  in  the  likelihood  ratio 


L(x) 


p(x 

01) 

p(x 

®o> 

(2.1) 


Thus,  the  mean  amount  of  information  for  discrimination  when  p(x  0q) 


is  the  true  density  is  of  the  form 


dx  (2.2) 

where  <j>(.)  is  a  properly  chosen  function.  It  can  be  argued  using  infor¬ 
mation  theoretic  arguments  (Akaike,  1973)  that  the  only  appropriate  form  is 

$(y)  =  log  y  (2.3) 


I(01 »  Q0 )  ■  /  P(x  |  Qq)  ♦ 
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which  leads  directly  to  the  measure 


B(©i,  ©q)  =  I  p(x  !  ©g)  loS 


dx 


(2.4) 


p(x  |  ©i) 
p(x)Q0 


Note  that  -  B(©i,  0g)  is  Che  Kullback-Liebler  information  for  discrimi¬ 
nation  in  favor  of  ©g.  It  can  be  easily  shown  that  B(0^,  Qg)  <  0  and 
equality  holds  if  and  only  if  p(x  |  ©j)  =  p(x  |  ©g)  almost  everywhere 
(Aitchison  and  Dunsmore,  1975). 

Note  that  B(©i,  ©g)  can  be  written  as 

S(© i ,  ©g)  =  /  p(x  j  Qg)  log  p(x  |  ©i)  dx 

~  /  P(x  |  ©o)  loS  P(x  |  0O)  dx  (2.5) 

Since  ©g  represents  the  true  (unknown)  parameter,  our  objective  is  to  find 

a 

the  parameter  estimate  ©  which  maximize  B(0,  Qg).  From  (2.5),  we  need 
only  maximize 


/  p(x  |  Qg)  log  p(x  |  ©)  dx 

A 

with  respect  to  ©  to  produce  our  estimate.  This  estimate  maximizes  the 
expected  log-likelihood  and  is  thus  a  maximum  -  likelihood  estimate. 

2.2  Preliminaries 

In  order  to  present  a  clear  development,  we  will  work  in  a  partitioned 
sample  space.  The  random  variable  x  is  presumed  to  be  in  n  -  dimensional 
Euclidean  space,  x  e  Rn,  and  Rn  is  partitioned  into  s  mutually  disjoint 
regions  &2»  •  •  •  >  ^s  which  cover  Rn: 
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nL  u  n2  u  .  .  .  u  ns  =  Rn 

«j  a  flj  =  0  ;  i*j 
We  then  define 

Pi (9)  =  /  p(x  |  ©)  dx  (2.6) 

ni 

i  =  1 ,  2 ,  •  .  .  ,  s 

We  consider  two  different  samples,  an  informative  sample  q  and  a  pre¬ 
dictive  sample  r.  The  informative  sample  is 

*q  =  (xqi*  Xq2  •  •  •  •  »  Xqnq} 

which  consists  of  nq  observations  of  x.  The  predictive  sample  is 
xr  =  {xrx,  xr2,  •  .  .  ,  x^^,} 

consists  of  nr  observations  of  x.  We  assume  that  nqi  of  the  informative 
samples  fall  into  8^  and  that  nri  of  the  predictive  sample  fall  into 
Then 


s 


(2.7) 

s 

l  nri  =  nr 
i=»l 


The  two  samples  Xq  and  xr  are  from  the  true  distribution. 


Thus  we  have,  approximately,  for  sufficiently  large  samples. 


Pqi(0O)  3  (2.8) 

nq 

and 

Pri<0O>  3  (2-9) 

“r 

and  we  assume  regularity  conditions  throughout  such  that 

Pq(x  j  ®o)  =  lim  Pqi(e0) 
nq  ♦  - 
s  ■+■  00 

where 

lim  =  x 

s  +  00 
x  e  % 

and  similarly  for  pr(x  |  0q).  The  computation  of  the  probabilities  asso¬ 
ciated  with  the  parametrized  densities  is  different.  Here  we  use  the  defi¬ 
nition  (2.6)  and  note  that  Pi(Q)  is  computable  from  p(x  |  0)  and  knowledge 
of  In  practice,  this  computation  need  not  be  done,  as  become  clear  in 

the  sequel. 

2.3  Entropy  and  Maximum  Likelihood  Estimation 

The  first  step  in  our  development  is  to  form  the  maximum-likelihood 
estimate.  This  is  done  by  maximuzing  (2.5)  on  the  informative 

sample: 

w 

0  =*  arg  max  Bq(0,  0g) 

0 

where 
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Bq(0>  G0)  =  l  Pqi(0o)  lo§ 
i=l 


Pqi(0) 

Pqi(0O> 


(2. 1C) 


Thus 


l  Pqi(0o) 
i=l 


3lo jpqi(0) 
30 


=  0 


(2.11) 


We  note  here  that  an  approximation  for  Bq(0i<  ©g)  is 


Bq(0,  0O) 


n 

q 


l 


i=l 


p(xi  j  0) 
l0g  p(xi  0o) 


(2.12) 


and  the  two  expressions  are  asymptotically  equal  as  nq  ®.  This  form  was 
used  by  Akaike  (1973)  to  derive  the  AIC  criterion. 


Solving  (2.12)  would,  in  principal,  give  the  maximum-likelihood  esti¬ 
mate  if  the  dimension  of  0  were  known.  However,  in  practice,  the  actual 
dimension,  m,  of  9  is  not  known.  Furthermore,  there  is  an  obvious  tradeoff 

A 

between  the  dimension  of  our  estimate  0  and  prediction  error.  Assume 

0  e  Rk.  Then  as  we  increase  k,  the  fit  error  on  the  informative  sample 
will  decrease  momotonically.  However,  at  some  point  we  are  in  danger  of 

A 

overfitting  the  model  so  that  0  is  a  function  of  the  sampling  error  on  the 
informative  sample.  When  this  happens,  the  fit  errors  on  the  predictive 
sample  will  begin  to  increase. 

If  we  assume  that  the  true  parameter  vector  dimension  is  m  and  that  the 
estimated  parameter  dimension  is  k  <  m,  then  our  objective  is  to  evaluate 
the  information  measure  on  the  predictive  sample  and  select  the  model  which 
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maximizes  this  measure.  The  discrimination  measure  is  now  separated  into 
two  parts  in  order  to  simplify  the  analysis: 


~k 

Br(0,  0O) 


l 


i=l 


Pri(0O)log 


Pri(Qk) 

Pri<0O) 


?  /rt  x,  Pri(9  )  ?  x,  PriW 

=  L  Pri(Q0)l°g  - —>77  "  L  Pri(0o)1°g  - —77 

i=l  Pri(0k)  i=l  pri(0k) 


=  Br(0k,  0k)  _  Br(0o,  0k)  (2.13) 

where  0*°  e  R^-.  Both  entropy  measures  are  measured  with  respect  to  the  den¬ 
sity  pri(®k)  and  0^.  is  arbitrary.  We  will  in  the  sequel  pick  ©k  in  a  par¬ 
ticular  manner  which  clarifies  and  simplifies  the  development.  The 
decomposition  of  (2.13)  is  done  to  clarify  the  exposition  and  to  make  clear 
the  crucial  role  played  by  the  number  of  parameters  k.  The  summations  in 
(2.13)  are  taken  with  respect  to  the  true  density  on  the  predictive  sample 
while  Qk  is  the  estimate  computed  on  the  informative  sample.  Thus, 

Br(Qk,  eQ)  is  a  measure  of  the  information  between  the  estimated  den¬ 
sity  and  the  true  density  on  the  predictive  sample.  Since  the  informative 
sample  is  known  but  the  predictive  sample  is  not  we  will  use  statistical 
mean  values  in  the  sequel. 

In  order  to  evaluate  Br(0k,  ©k)  ancj  Br(0g ,  0k)  we  will  expand  around 
the  actual  probabilities  on  the  informative  sample. 

A 

Evaluation  of  Br(0k?  ©k) 
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From  (2.13): 


Br(©k,  0^)  =  I  pri(0o)  [log  pri(0k)  -  log  pri(0k)]  (2.14) 

i=l 

Define  Che  sampling  error  between  Che  informative  and  predictive  proba¬ 
bilities  as 


ei(0)  =  Pri(©)  -  pqi(0)  (2.15) 

Expanding  the  log  term  to  second  order  yields 

,  Slog  Pai(®^)  i 

log  prl(Qk)  =  log  pqi(0k)  +  - g|— -  ei(0k) 

1  2(0k)  +  (0k _  9k) 

2  3p  2  1  30k 

qi 

+  7  (ik  -  0k,T  --  'vS5--0---  <°k  -  ®k)  +  <°k  -  0k>T  ei(0k) 

qi 

(2.16) 

Thus 


Br(0k,  0k)  =  f  pri(0o)  31°fkPql-~  —  C®k  "  0k) 

i=l  30 

+  J  J 1  Pri(0o)  (@k  -  0k)T  3  (®k  '  0k) 

+  l  pri(90)  (9k  -  9k)T  9  Pq'V—  «i(0k)  <2*17) 

i=l  30K  9pqi 

This  expression  can  be  further  simplified  by  utilizing  the  fact  that, 
since  0k  is  a  maximum-likelihood  estimate  on  the  informative  sample: 
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0 


(2.18) 


3l0§  Pqi(Q  ) 
L  Pqi(9o)  — 
i=l 


30k 


Expanding  this  around  0k  yields 


l  Pqi(Q0) 


i=l 


3log  pqi(0k)  32log  pqi(9k)  .  ' 

—  +  — - <0k  -  0k> 


30k 


=  0 


(2.19) 


Using  (2.15)  and  (2.19)  and  in  (2.17)  yields 

Br(9k,  ®k)  =  I  ei(0o)  — g-,^l(0k)  (0k  _  0k) 

i=l  30k 

+  J  l  ei(®0)  (Qk  '  Qk)T  — -l0g  Pql(9  }  .  (0k  _  0k) 
i=l  30k2  ’ 

"  J  I  Pqi(Q0)  (®k  “  ®k)T  —  .i.°.^,.Pq1^0  ^  .  (0k  _  ©k) 
i=l  30k2 

+  [p,i<0o>  +  Pi(e0)!  <9*  -  9k)T ei(so)  (2.20) 

Pqi 

where  we  have  assumed  ei(9k)  =.  ej.(0o). 

The  error  e^OQ)  is  the  difference  of  two  probabilities,  which  are 
binomially  distributed,  by  construction: 

ei(0o)  =■  pri(0o)  -  pqi(9o) 
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Further-ore  [  e ^ =  I1,  by  definition. 

Since  we  are  assuming  here  chat  pri(0())  and  Pqi(0o)  are  independent  samples 
from  the  sane  underlying  distribution,  ej_(3Q)  is  unbiased: 

E  { e i ( 3 o ) ;  =  0  (2.21) 

where  E  {  }  denotes  expectation  with  respect  to  all  underlying  random 
variables.  Recalling  that  the  informative  sample  is  of  size  nq  and  the 
predictive  sample  is  of  size  nr,  Pqi(0(})  has  approximate  variance 

var  (pqi(Oo))  =  tr^  Pi(0o)  t1  ~  Pi(0O>] 

nq 

and  Pri(©o)  has  variance 

var  (pri(0o))  =  Pi(0o)  [1  “  Pi(0o)l 
nr 

Thus 

var  (eiCGQ))  =  -3  Pi(0o)  U  ~  Pi(0o)J  (2.22) 

n 

where  n  =  nqnr  /  (nq+nr).  The  expected  value  of  Br(0k,  0k)  can  now  be 
written  in  simplified  form  by  using 


32log  Pqi(0k)  ^  1  3log  Pj(9k) 

30k  3p  .  Pi(0O>  30k 

qi 
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s 

Furthermore  £  e^Og)  =  0,  by  definition. 
i=l 

Since  we  are  assuming  here  that  pri(0g)  and  Pqi(9g)  are  independent  samples 
from  the  same  underlying  distribution,  e^(0g)  is  unbiased: 

E  (eiCOo) }  =  0  (2.21) 

where  E  {  }  denotes  expectation  with  respect  to  all  underlying  random 
variables.  Recalling  that  the  informative  sample  is  of  size  nq  and  the 
predictive  sample  is  of  size  nr,  pqi(Qg)  has  approximate  variance 

var  (pqi(©o))  =  — ^  Pi(Qo)  t1  “  Pi(Qo)l 

nq 

and  pri(Qg)  has  variance 

var  (pri(©o) )  =  — -  Pi(©o)  U  ~  Pi(9o)I 
“r 

Thus 


var  (e^(0g))  =  Pi(Og)  [1  -  Pi.(9g)]  (2.22) 

n 

a 

where  n  =  nqnr  /  (nq+nr).  The  expected  value  of  Br(0k,  9'c)  can  now  be 
written  in  simplified  form  by  using 


32log  Pqj(0k) 


30k  ap 


qf 


1  3log  Pi(0k) 

Pi(®0)  30k 
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The  result  is  that  the  expected  value  of  Br(0^,  ©k)  is 


E  {Br(0k,  0k)| 


«  -  j  E  |  ^  Pi(90)  (®k  -  ®k)T  -3q^-  — <®k 


~  l  E  I (9k  -  0k)T 

n  i=l  l 


3log  pi(@  ) 
30k 


(2.23) 


In  the  sequel  we  will  choose  0k  =  0*k  so  that  0*k  is  a  minimum-variance 
estimate  of  0q.  This  results  in  the  second  term  being  much  smaller  than 

the  first  term  for  reasonably  large  values  of  rf/s.  We  will  explicitly 
neglect  this  term  in  the  sequel. 


Evaluation  of  Br(0Q,  0*k) 


From  (2.23) 


Br(0O>  Qk)  = 


“  7  l  Pi(0o)  (e0 

i=l 


3  log  Pi(0q) 


"  y  (Q0  -  0k)T  K®0)  (®0  -  0k> 


(0O  -  0k)  (2.24) 


(2.25) 


where  I(0q)  is  the  information  matrix 


2- 


(2.26) 


UQq)  = 


s 

r 

L. 


i=  1 


Pi(0O> 


34log  piCOq) 


In  (2.23)  both  0k  and  ©k  are  k-dimensional  parameter  vectors.  Here, 
however,  0g  is  an  nr-dimens ional  vector  (m  >  k).  To  handle  this  situation 
we  write  0q  -  0k  e  Rm  as 


0O  _  Qk  = 


0gk  -  0k 
_  °0 


where  9gk  s  R^,  0q  e  Rm-k 
setting 


J(0k)  =  I  (0O  -  0k)T  I(0O)  0O  -  0k) 
and  minimizing  with  respect  to  0k  yields 

0*k  .  0Qk  _  ijj-1  (0q)  i12  (0q)  0O 


where  we  have  partitioned  I(9q)  as 


I(0O)  * 


~  Iu(0o) 

T 

_  Il2(Q0) 


Il2(0O) 
I22(0O)  _ 


(2.27) 


The  minimum  value  of  J  is 


J(0*k)  .  |  i0T  [ I22(0o)  -  Ii2(Qo)T  xll(0o)~L  Il2(0O)l  0O 
If  we  particion  the  covariance  matrix 

P(®0)  =  K0o)_1 

pll(0o)  p12(0o) 

T 

L  P12(0O)  P22(0O)_, 

then 

J(0*k)  =  ^  0OT  P22(Qo)~1  0O 

where  P22(0o)-1  =  x22  ~  X12T  Ill_1  x12 
Since 

s 

P(0O)  -  I  pi(0)  (0o  -  0*k)  (0o  -  0*k>T 
i-1 

-  E  [(0q  -  0*k)  (0O  -  0*k)T] 
we  get,  finally, 

E  [ J(0*k) ]  -  |  (m  -  k) 

or 

I 


E  tBr(0O.  0*k)l  -  i  Ot-m) 


2.4  Unbiased  Estimate  of  Entropy 


From  (2.23) 


E  jBr(0k,  Qk)  j.  =  _  ^  (Qk  _  0k)T  i(0k)  (Qk  _  ©k) 
where  l(0k)  is  the  kxk  information  matrix 


l(0k)  =  E  J 


32logpi(0k) 


J,  Pi(0O>  301=2 


and  Pi(©o)  is  given  in  (2.6).  Using  (1.5)  and  (1.6)  we  see  that 


E  |Br(0k,  0k)  |  -  -  |  tr  Ik 


k 

7 


where  Ik  is  the  kxk  identity  matrix. 


Combining  (2.29)  and  (2.28)  yields 


(2.29) 


E  |Br(Qk,  0O)|  -  7  -  k  (2.30) 

A  A 

where  0k  is  the  maximum  likelihood  estimate  (0k  e  R.k).  This  represents  a 
bias  in  the  maximized  log-likelihood  function,  with  the  result  that  our 
goal  is  to  pick  k  such  that 


2- 


I  Pqi  (Q0)  loS  Pqi  (®k)  +  T  “  k 
i=l 

is  maximized.  By  reference  to  (2.10)  and  (2.12),  this  is  equivalant  asymp¬ 
totically  to  picking  k  such  that 

n 

q 

I  log  p(xi  j  Qk)  +  y  -  k 
i-1 

is  maximized.  Since  m  is  a  constant  here,  the  equivalent  goal  is  to  mini¬ 
mize 

n 

AIC(k)  -  -  2  l  log  p(xi  j  ©k)  +  2  k  (2.31) 

i=l 

with  respect  to  k,  which  is  Akaike's  AIC  criterion. 


CHAPTER  3 

3.  CANONICAL  VARIATES  ANALYSIS 

We  now  consider  Che  linear  prediction  problem  using  Che  canonical 
variates  analysis  approach. 


Let  the  past  be  represented  as  a  column  vector  P(t)  defined  by 


P(t) 


y(t) 

y(t-l) 


|_  .  nxl 

and  define  the  future  as  a  column  vector 


F(t) 


Y(t+1) 

y(t+2) 


m  <  n 


mxl 

where  y(t)  is  the  r-dimensional  observed  output  at  time  t.  Our  goal  is  to 
predict  the  future  F(t)  given  P(t). 


We  now  consider  the  canonical  variate  analysis  in  a  form  that  allows  us 
to  explicitly  show  the  optimality  properties  of  the  method. 

Consider  nonsingular  transformations  of  the  past  and  future 


c(t) 

*  J 

P(t) 

(3.1) 

nxl 

nxn 

nxl 

d(t) 

-  L 

F(t) 

(3.2) 

mxl 

mxm 

mxl 
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and  form  a  k—  order  estimate  of  F(t) 

„  k 

Fk(t)  =  l  at  Cl(t)  (3.3) 

i=l 

where  {a-j_}  are  mxl  vectors  and  c^  is  the  i—  component  of  c  (a  scalar). 
Since  a^  is  fixed,  only  c^Ct)  depends  on  the  data.  Since  J  is  only 
constrained  to  be  nonsingular,  we  can  use  a  very  general  form  for  it. 
Without  loss  of  generality  we  can  specify  that 

E  [c(t)  c(t)T]  =  Inxn  (3.4) 

Let  B  be  an  orthonormal  matrix: 

T 

®nxn  Bnxn  =  ^nxn  (3.5) 

Then 

J  Spp  jT  -  BT  B  (3.6) 

where  Spp  =  E  [P(t)  P(t)T] 

This  has  a  solution 

T  -1/2 

J  -  B  Spp  (3.7) 

Now 

T 

ci  =  p(t)  (3.8) 

T 

t*  h 

where  Jj[  is  the  i—  row  of  J; 
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T  T  -1/2 

Ji  =  bi  Spp 

and 


B  =  [bj  b2  •  •  •  bn] 

nxn 


Thus 


T  -1/2 

ci  =  bi  Spp  P(c) 
and  the  estimate  ?k(t)  is 


K(t) 


H  T 

i  ai  bj. 
i=l 


-1/2 

Spp  P(t) 


J  Qk  Spp72  P(t) 


where 


Qk  =  l  ai  bi 

i=l 


Note  that  Qk  has  maximum  rank  k. 


The  prediction  error  is 


-1/2 

ek(t)  =  Qk  Spp  P(t)  -  F(t) 

We  now  form  a  quadratic  cost  function 


(3.9) 


(3.10) 


(3.11) 


(3.12) 


(3.13) 


(3.14) 
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W  =  E  [ek(t)  W  ek(t)] 


-1  ,  -1/2  T  -1/2  T  T 

=  tr  W  E  { [Qk  Spp  P(t)  -  F( t ) ]  [P(t)  Spp  Qk  -  F(t)]} 

-1  T 
-  tr(W  Qk  Qk) 

-1  -1/2  -1 
-  2tr(W  Qk  Spp  Spf )  +  tr(W  Sff) 


where  Spf  =  E  [P(t)  F(t)T] ,  Sff  =  E  [F(t)  F(t)T] 

In  order  to  handle  the  orthonormality  constraints  we  add  the 
equations  via  Lagrange  multipliers  to  form  the  augmented  cost 


k  T 

Lk  =  Lk  +  l  x±  (bi  bi  -1) 
i=l 

where  {if}  are  Lagrange  multipliers.  Thus 


_  _1  k  x  n  T 

Lk  =  tr  (w  l  a±  bt  l  bj  a j } 
i=l  j=l 

-1  k  X  -1/2 

-  2  tr  {W  I  31  b£  Spp  spf} 

1=1 


+  tr  {W  Sff} 


T 

Using  bi  bj  =  6ij>  with  6  the  Kroneker  delta  function, 
and  rearranging  gives 


3-4 


(3.15) 


constraint 


(3.16) 


(3.17) 


(3.18) 


I 

ai 

-1/2  _1 

Spp  Spf  W 

v  T 

+  l  Ai(bi  bi  -1)  (3.19) 

i=l 

Taking  partial  derivatives: 

3Lk  T  -1  T  -1/2  -1 

=  2  H  W  -  2  bi  Spp  Spf  W  (3.20) 

3Lk  T  -1  T  -1/2  T 

-  -2  8l  W  Spf  Spp  +  2  Af  bf  (3.21) 

Thus,  the  first  order  necessary  conditions  for  minimizing  Lk  are 


Lk  = 


£  T  -1 

l  H  w 

i=l 


t  T 
-2  l  bi 

i=l 


-1 

+  Cr(W  Sff) 


*  T  -1/2  * 

ai  =  spf  spp  bi 


(3.22) 


*  -1/2  -1  * 
Af  bi  =  Spp  Spf  W  af 

for  i  =  1 1  2,  ...,  k. 


> 


(3.23) 


* 

Eliminating  ai: 


*  -1/2  -1  T  -1/2  * 

Ai  bi  =  Spp  Spf  W  Spf  Spp  bf 

which  is  an  eigenequation. 


(3.24) 


3-5 


The  first  term  of  (3.19)  becomes 


£  *T  -1  * 

L  ai  w  ai 
i=l 


£  *T  -1/2  -1  T  -1/2  * 

L  bf  Spp  Spf  W  Spf  Spp  bf 


k 

=  l 

i=l 


The  second  term  of  (3.19)  becomes 


k  *T  -1/2  -1  T  -1/2  * 

-2  l  bf  Spp  Spf  W  Spf  Spp  bf 
i=  1 


k 

=  -2  I  Xf 
i=l 


Thus,  the  optimized  cost  is 


*  _1  k. 

Lk  =  tr(W  Sff)  -  l  Xf 

i=l 


Now  let 


-1/2  -1/2 

R  =  Spp  Spf  W  (nxm)  n>m 


From  (3.24), 


(3.25) 


(3.26) 


(3.27) 


(3.28) 


*T  T  * 

bf  R  R  bf  =  Xf  (3.29) 

By  using  a  singular  value  decomposition  on  R: 
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(3.30) 


where  Yi  >  Y2  >  •  •  • 

T  T  T  T 

Then  R  R  =UDV  VD  U 

T  T 

=  UDD  U  (3.33) 

Then,  from  (3.29) 


*T  T  T  *  ,/N 

bi  U  D  D  U  bi  =  (3.34) 

Thus  bj[*  is  the  eigenvector  of  U  D  D1  UT  whose  eigenvalue  is  X^. 

Now  let 

U  =  [Di  U2  .  .  .  UnJ  (3.35) 

where  the  are  mutually  orthogonal  unit  vectors  by  construction.  But  the 
matrix  U  D  DT  has  eigenvectors  and  associated  eigenvectors  Y ±z  since 

UtT  U  D  Dt  Ut  Uj  =  Yi2  «ij  (3.36) 

Thus 
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ai  —  Spf  Spp  U±  (3.3i 

By  using  (3.37)  in  (3.27)  we  get 

*  -1  ~  2 

Lk  =  tr(W  Sff)  -  l  Yi  (3.3' 

i=l 

and  we  see  that  the  cost  is  minimized  by  using  the  k  largest  canonical 
variances,  Y].2  >  Y22  >  Y32  >  .  .  >  Yk2* 

We  can  now  write  the  optimal  forecast  as 


*  k  *  * 

Fk(t)  «  [  a-t  ci(t) 

i=l 

£  T  -1/2  T  -1/2 

-  I  Spf  Spp  Ui  Ui  Spp  P(t) 

T  -1/2  £  T  -1/2 

=  S  f  Spp  (  l  ^  Ui)  Spp  P(t)  (3.4 

i-1 


* 

Thus,  if  we  denote  the  optimal  weighting  matrix  by  A^: 


Fk(t)  =  Ak  P(t)  (3.4 


*  T  -1/2 

^k  ”  Spf  Spp 


V  T 

l  Ui  Ui 

i=l 


(3.4 


Note  that 


(3.43) 


*  T 

-  Spf  Spp 

To  decermine  L  (cf  (3.2)),  we  can  use  the  condition 

E  (cdT)  =  D  (3.44) 

or 

J*Spf  LT  -  D  (3.45) 

From  (3.7), 

T  -1/2  T 

uT  SPP  Spf  lT  =  D  (3.46) 

But 


D  =  UT  R  V 

„  -1/2  -1/2 

=  UT  Spp  Spf  W  V  (3.47) 

Comparing  (3.46)  and  (3.47)  gives 


T  -1/2 

Lt  .  W  V  ,  or 


L  =  VT  W 


-1/2 


(3.48) 


* 

Note  that  A^,  the  optimal  gain  matrix  is  of  dimension  mxn  but  has  a  maximum 
rank  of  k. 


Note  that  k  <  m  since  the  symmetric  matrix  in  the  eigenequation 
(3.24)  has  rank  <  m.  This  is  very  important,  as  it  implies  that  we  need  to 
make  the  dimension  of  the  future  vector  (m)  at  least  as  large  as  the  maxi¬ 
mum  expected  order  of  the  estimator. 


An  efficient  computation  of  A^  is 


*  -1/2  -1/2 
di  =  spp  ui  I  spp  symmetric 

*  T  * 
ai  =  spf  di 

*  k  *  *  x 

Ak  =  l  ai  di 
i=l 


Cholesky  Form 


The  cholesky  factorization  of  a  positive-definite  matrix  is  an  attrac¬ 
tive  way  of  computing  a  square  root  matrix.  Let 


i  T 

Spp'1  =  Q  Q 

Then  we  get  the  following  relations 


*  T 

ai  =  spf  Q  ui 

*  T  -1  T 

*i  =  Q  Spf  W  Spf  Q  Ut 


T  -1/2 

R  -  Q  Spf  w 


Fk(t)  -  Spf  Q  (  l  Uf  Ui)  QT  P(t) 
i=l 


*  x  k  T  T 

Ak  -  Spf  Q  (  l  Uf  Ui)  Q 
i=l 
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Trucated  Predictor 


1  In  the  sequel,  we  will  be  restricting  the  total  number  of  parameters 

allowed  in  the  predictor.  The  question  arises  as  how  to  best  truncate  the 
prediction  equations.  Our  approach  is  to  use  only  the  most  recent  past 
values.  For  example,  suppose  we  have  used  m  =  5  in  our  analysis,  but  wish 

i 

only  to  use  a  one-step- ahead  predictor  with  k  parameters.  Then  our  predic- 

i  ie 

tor  uses  only  the  first  k  elements  of  the  first  row  of  A^. 

Inclusion  of  Known  Inputs 

If  we  have  an  unknown  system  with  measured  outputs  y(t)  and  measured 
inputs  u(t),  the  analysis  of  this  section  holds  with  only  slight  modifica¬ 
tions.  If  we  augment  the  past  vector  as 


P(t)  = 


y(t) 

u(t) 

y(t-l) 

u(t-l) 


(3.49) 


then  all  of  the  analyes  of  this  section  holds  and  the  predicted  values  of 
y(t)  depend  on  both  past  values  of  y(t)  and  on  past  values  u(t). 
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CHAPTER  4 


STATE  SPACE  MODELS  USING  CVA-AIC  TECHNIQUE 


We  now  consider  che  problem  of  deceraining  Che  scace-space  nacrices 
direccly  from  che  linear  prediccion  solucion.  Recall  from  Seccion  3: 


P(t) 


y(c) 

y(t-l) 


(4.1) 


nxl 


F(C)  =  A  P(c)  (4.2) 

If  we  rescricc  our  problem  Co  a  one-s cep-ahead  prediccion  of  y(c),  Chen 

y(t+l  |  c)  =  A  P(c)  (4.3) 

where  A  is  mxn. 

We  can  wrice  a  recursion  for  P(c)  as  follows: 

P(t-fl)  -  M  P(t)  +  T  y(c+l)  (4.4) 

where 


(4.5) 


where  all  submacrices  are  mxm. 


M 


0  0 . 0 

I  0 . 0 

0  I  0 . 0 

•  •  « 

•  •  • 

•  • 

0 . 0  10 


(4.6) 

so  chac  P(c)  is  in  recursive  fora  with  a  driving  Cera  y(t+l). 

The  state-space  formulation  employs  a  Kalman  filter  for  updating.  The 
equations  for  the  time-invariant  filter  at  steady  state  are 


^mxm 


0(n-m)xm 


y(t+l  |  t)  =  H  x(t+l  |  t)  (4.7) 

x(t+l  j  t)  =  4  x(t  J  t)  (4.8) 

x(t+l  |  c+1)  =  x(t+l  |  t)  +  K  [y(t+l)  -y(t+l  j  t)]  (4.9) 

Combining  (4.7)  -  (4.9)  yields 


- 

y(t+l  j  t)  =  H  $  x(t  )  t)  (4.10) 

x(t+l  |  t+1)  =  (I-KH)  *  x(t  |  t)  +  K  y(t+l)  (4.11) 

* 

as  the  state-space  equation  set.  The  linear  prediction  set  is 


7(t+l  I  t)  -  A  P(t)  (4.12) 

P(t+1)  =  M  P(t)  +  T  y(t+l)  (4. 13) 

*. 

What  we  seek  to  do  is  match  these  two  pairs  of  equations  by  finding  the 
state-space  matrices  H,  i,  K  which  give  the  best  "fit”  to  the  linear  pre¬ 
diction  equations. 


Equations  (4.7)  -  (4.9)  can  also  be  put  in  the  form 
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(4.14) 


y( t+1  j  c)  =  H  x( t+1  |  c) 

x(t+l  |  t)  =  4(1  -  KH)x(t  |  t-1)  +  4  K  y(c)  (4.15) 

We  can  solve  che  problem  operationally  by  using  solutions  involving  delay 
operators . 

We  will  first  solve  the  linear  prediction  equations.  From  (4.13), 

P(t+1)  =  zM  P(t+1)  +  T  y(t+l)  (4.16) 

where  z  is  the  delay  operator:  z  P(t+1)  =  P(t) 

Solving  (4.16): 

-1 

P(t+1)  =  (I-zM)  T  y(t+l)  (4.17) 

Thus,  from  (4.12): 

y(t+l  |  t)  -  A  (I-zM)  1  T  y(t)  (4.18) 

We  can  also  solve  the  state-space  equations  in  the  same  way. 

From  (4.10)  and  (4.11)  we  get 

i  “1 

x(t  |  t)  -  [I  -  (I-KH)  4z]  K  y(t)  (4.19) 

so  that 

y(t+l  |  t)  -  H  4  [I  -  (I-KH)  4z]_1  K  y(t)  (4.20) 

while  from  (4.14)  and  (4.15)  we  get 
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x(  t-K2  c-Hl)  =  (I  -  4 (I-KH)z ]  i  K  y(t+l) 


(4.21) 


.  -1 

y( t+1  |  c)  =  H  (I  -  $(I-KH)z]  i  K  y(t)  (4.22) 

If  we  could  gee  a  perfect  match  between  the  linear  prediction  and  the 
state-space  predictors,  then  the  following  equation  would  be  satisfied 

-1  -I 

A  (I-Mz)  T  =  H  <f  [I-(I-fCH)<tz]  K  (4.23) 

or,  equivalently 


A  (I-Mz)  1  T  =  H[I-$(I-KH)z ]  1  *  K  (4.24) 

Using  (4.23),  we  see  that  exact  matching  occurs,  for  dim  (x)  =  n,  if 


A  =  H  $ 

M  =  (I-KH)  4> 

T  =  K 

Equation  (4.26)  can  be  written  as 


(4.25) 

(4.26) 

(4.27) 


M  -  *  -  TA, 
so  that 


(4.28) 


=>  M  +  TA  (4.29) 

In  addition,  (4.25)  gives 

H  -  A  *  _1  (4.30) 

Since  H  must  satisfy  A  =  H  (M  +  TA) ,  it  is  easily  shown  that  H  is  In  cano¬ 


nical  form 


H  =  f*mxm  ^ax(n-ni)] 

If  5  is  a  valid  cransicion  matrix,  it  is  guaranteed  to  be  invertible.  Thus 
we  only  need  to  guarantee  the  invertibility  of  M  +  TA.  Using  the  defini¬ 
tion  of  M  and  partitioning  T  and  A  appropriately,  we  see 


nxn 


^mx(n-m)  ^mxm 

^(n-m)x(n-m)  0(n-m)xm 


J-mxm 

0(n-m)xm 


AI  a2 

_I  0  _ 

The  inverse  is 


[ 


Almx(n-m)  A2mxm 


1 


(4.31) 


0 

I 

“ 

- 

1 

$ 

= 

-1 

-1 

L  a2 

-a2 

AlJ 

Thus  * 

^  exists  if  A2 

-1 

exists. 

To  ci 

partitioned  form  as 

it 

-S’ 

< 

■  » 

T 

T 

"  *11 

*12_ 

spfl 

^P^^pxp 

T 

_ 

L  w12 

w22_ 

(4.32) 

To  check  this,  write  equation  (4.15)  in 


(4.33) 


where 
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W11  «12 
_  w12  '"21. 

Then 


T  T 

a2  =  sp£l  w12  +  spf2  w22 
Now  partition  Spp  as 


S11  s12 


S 


PP  ~ 


s22  J 


Then 


-1  T  -1  -1 

w12  =  "  sil  s12  (s22  ~  s12  S11  s12) 

T  -1  -1 

w22  =  (s22  '  s12  S11  s12) 


(4.34) 


(4.35) 


(4.36) 


(4.37) 

(4.38) 


so  that 


A2 

Therefore 


T  T  -1 

(spf2  “  spfl  S11  s12)  (s22 


T  -1  -1 

Sl2  S11  s12> 


(4.39) 


-1  T  -1  T  T  -1  -1 

a2  “  (s22  ~  s12  S11  s12)  (spf2  “  spfl  S11  s12) 

Solving  for  yields 


(4.40) 


T  T  T 

A1  “  spfl  W11  +  Spf 2  W12 

Using 


(4.41) 
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-1 

S11 


(4.42) 


W 


11  = 


-1  T  -1 

+  ^11  ^12  (^22  ~  Sj?  S^i 


-1  T  -1 

s12>  S1 2  SU 


we  gee 


T  -1 

A1  =  spfl  S11  + 

T  -1  T  T  -1  -1  T  -1 

[spfl  Su  S12  -  Spf 2 ]  (S22  -  S12  Sn  S12)  S12  SU  (4.43) 

In  summary,  we  can  use  the  direct  solution  if  the  matrix 


T  T  -1 

sce  *  spf2  ~  spfl  SL1  S12  (4.44) 

is  invertible. 

This  exact  solution  is  restricted  to  the  case  dim  (x)  =  n.  This 
implies  that 

dim(x(t))  =  dim(P(t)) 

Truncated  Filter 

Given  the  matrices  for  the  full-order  Kalman  filter  $nxn,  Hmxn»  Knxm» 
the  question  arises  as  to  whether  there  is  a  suitable  truncation  to  a 
lower-oder  form  required  to  meet  restrictions  on  the  total  number  of  para¬ 
meters.  Using  the  forms  of  (4.27),  (4.29)  and  (4.30)  and  assuming  an  order 
k  <  n,  and  prediction  of  the  first  p  future  values  (p  <  m),  we  truncate  as 
follows: 


(1) 

*kxk 

is 

the 

upper 

left 

kxk 

subraatrix 

of 

A 

vnxn 

(2) 

^pxk 

is 

the 

upper 

left 

pxk 

submatrix 

of 

^mxn 

(3) 

^kxp 

is 

the 

upper 

left 

kxp 

submatrix 

of 

^nxm 

CHAPTER  5 


CONFIDENCE  BAND  AND  ACHIEVABLE  IN  SPECTRAL  ESTIMATION 

5.1  Spectral  Estimation  Problem 

The  problem  of  determining  the  statistical  accuracy  in  identifying  a  model  for  a 

stationary  multiple  time  series  is  considered  in  this  chapter.  The  cases  of  the  presence  or 

absence  of  an  exogenous  input  or  additive  measurement  noise  are  included.  Consider  the 

general  case  where  the  vector  x(t)  is  the  exogenous  input  and  the  vector  y(t)  is  the 

observed  endogenous  output  of  a  system  which  may  include  other  unknown  excitations  and 

measurement  noise.  Thus  consider  the  jointly  stationary  gaussian  vector  time  series  x(t) 

and  y(t),  t  =  ...,  with  power  cross— spectral  matrices  S  (u/,0),  S  (cj,0),  S  (uj,0) 

xx  xy  yy 

parameterized  by  0,  and  denote  the  power  cross— spectral  matrices  of  the  joint  vector 
(xT(t),yT(t))T  as  SM). 

Statistical  inference  is  considered  on  a  class  of  linear  Gaussian  processes 
parameterized  by  9.  Specifying  a  parametric  model  for  the  conditional  process  y(t),  t  <  s, 
given  x  (t),  t  <  s,  implies  a  causal  linear  model  of  the  form 

l 

y(t)  =  q(t)  +  £  h(t— r;  0)x(t)  =  q(t)  +  r(t) 
r=0 


where  h(t;0)  is  a  causal  linear  system  giving  the  response  r(t)  due  to  the  past  exogenous 

input  x(t)  and  where  q(t)  is  the  error  in  predicting  y(t)  by  r(t).  From  linear  prediction 

theory,  the  transfer  function  of  h(t;0)  =  S  {u,0)S  and  the  error  q(t)  in  predicting 

y  x  xx 

y(t)  is  uncorrelated  with  r(t)  with  power  spectrum  S^(u\0)  =  S^^(uj,0)  - 

* 

H(u,0)Sxx(o;,tl)H  {uj,0).  Note  that  any  class  of  parameterized  models  S (u,9)  can  be 
equivalently  specified  by  the  parameterized  models  (S  (w,0),H(u;,0))  which  will  prove 

MM 
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more  convenient. 


It  is  convenient  to  work  entirely  in  the  frequency  domain  and  specify  the  probability 

distribution  and  likelihood  functions  in  terms  of  the  power  cross -spectral  density  matrices 

and  Fourier  coefficients.  For  simplicity  the  time  series  case  with  t  a  scalar  is  developed 

below,  however  the  results  generalize  easily  to  the  random  field  case  of  a  vector  t.  Then 

asymptotically  the  log  likelihood  function  is  given  following  Whittle  (1953)  and  Larimore 

(1977)  with  Q(cj)  =  Y(u/)  —  R(u)  and  using  the  relationship  E{Q(w)  =  Y(cj)  —  R(w)  and 

* 

using  the  relationship  E{Q(cj)X  (w)}  =  0  by 


N  Nx 

log  p(Xj^)  =  7)"log27r  2~J  [log|  SqqM|  +  Q*MSqq(w)]^ 


and  the  elements  of  the  gradient  vector  Slogp j  dO  and  Fisher  information  matrix  F(0)  are 


*  ,  dS  (u) 

V  .no-1/,  a  qqv  ' 


FiJ(»)  =  -E{ 


*  d  H(w 

—  X(w)Q  (w)Sqq(w) — -ijjj- 

^logp 

dO-dO. 
i  J 


dca 

2i 


r7T 


=  -t/  ‘'[s-Jm 


— 7T 


5  S nqM 

rr~ 


{s-i(w) 

.\  qqv  ). 


d  S  (w) 

t r 

j 


+  s  V) 

xxv 


d  H(w) 

TT. 

J 


}]_du; 
2  7T 


5.2  Simultaneous  Confidence  Bands 


Let  be  a  variable  such  as  frequency  or  time,  and  consider  a  p— dimensional 
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compiex  vector  f  (7 ,0)  with  components  that  are  functions  of  7  and  0  having  continuous 
second  derivatives  with  respect  to  the  parameters  0.  For  example,  the  elements  of  the 
vector  function  f(7 ,0)  could  be  the  elements  of  the  spectral  matrix  S,  the  squared 
magnitude  coherencies,  the  impulse  response  functions  of  a  spectral  factor,  or  the 
covariance  functions  of  the  process.  Asymptotically 

f(7,JO-f(7,0)  =  y7 ,0)0-0) 

T 

where  ^(7 ,0)  denotes  the  matrix  of  partials  di  ( y,0)/d0  evaluated  at  0—0.  This 
expansion  and  the  Scheffe  method  (Scheffe,  1953,  1959,  p.  68—70)  of  simultaneous 
confidence  intervals  as  applied  in  Newton  &  Pagano  (1984)  lead  to  simultaneous  confidence 
bands  in  the  univariate  case.  For  multivariate  processes,  it  is  of  considerable  interest  to 
extend  these  results  to  simultaneous  confidence  bands  on  vector  and  matrix  functions  of 
the  parameters,  e.g.  the  spectral  matrix.  The  extension  that  we  will  consider  is  the 
quadratic  form 

{f(7>)  ~f(7,0)}*P(7){f(7,0)  -f(7,0)}  (2-1) 

which  will  be  bounded  as  a  function  of  7.  In  the  multivariate  case,  there  is  a  choice  to  be 
made  for  P.  For  reasons  of  invariance  and  to  obtain  an  equally  tight  confidence  bound  on 
any  linear  combination  of  f(7,0)  —  f(7 ,0),  P  is  naturally  chosen  as  the  inverse  of  the 
covariance  of  (2.1). 

In  the  sequel,  a  general  P  is  used  and  then  specialized  to  this  natural  choice.  The 
basic  mathematical  result  needed  for  such  an  extension  is  given  in  the  Appendix  and  is 
used  to  prove  the  following  theorem  on  simultaneous  confidence  intervals. 
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Theorem  1.  Consider  a  parametric  family  of  stationary  Gaussian  vector  processes 
with  power  cross— spectral  density  matrices  S(7,0)  for  0e0  satisfying  regularity  conditions 
(Whittle,  (1953),  and  for  which  the  parameters  are  locally  identifiable  so  that  the  Fisher 
information  matrix  F  (0)  as  given  by  (1.1)  is  full  rank.  Let  y(l),  y(2),...,y(N)  be  a  sample 
realization  and  0  be  an  asymptotically  normal  and  efficient  estimator  of  0.  Let  P(7,0)  be  a 
Hermitian  Matrix.  Then  as  N  -»  ®,  the  probability  is  at  least  1  —  a  that  simultaneously  for 
all  7 eT  the  true  p— vector  function  f(y ,0)  is  bounded  by 

{f(7,0)  -  f( 7, 0) }  P(7,0){7,0)  -  f( 7, 0) } 

<x^q  ui^,o)Y-l\(o)i){rO)v\  (7,j) 

o 

where  q  is  the  dimension  of  the  vector  0  and  where  X_  is  the  upper  a  critical  point  of  the 

dLdLjQ 

chisquared  distribution  on  q  degrees  of  freedom. 


Proof:  As  shown  by  Rothenberg  (1971),  the  parameters  are  locally  identifiable  if 
and  only  if  the  Fisher  information  is  full  rank.  Let  f(7)  and  f(7)  denote  f  (7,0)  evaluated  at 
0  and  0  respectively.  The  vector  random  variable  N*^{f(7)  —  f(7)}  is  asymptotically 
distributed  as  the  normal  random  vector  N^^f^(7,0)(0  —  0).  Asymptotically  (0  — 

T  2 

0)  F(0)(0—  0)  is  a  X  random  variable,  wheu  F(0)  is  proportional  to  sample  size  N  as  in 

a,q 

rp 

(1.1).  So  the  probability  is  1  —  a  that  the  true  0  satisfies  (0—  0)  M(0-  0)  <  1  where  M  = 

*  2  o 

F(0)/X  .  From  the  Appendix,  this  inequality  is  satisfied  if  and  only  if  ||H(0—  0))| 

a,q 


<trHM—1H  for  all  pxq  —dimensional  matrices  H.  Since  the  set  {H  =  P1/“(7,0)f^(7,0)  for 
72T}  is  possibly  a  proper  subset  of  all  pxq  —dimensional  matrices  H,  it  follows  that 
asymptotically  with  probability  at  least  1  -  a  the  inequality. 


»pl/2 1 


.\{f(7)-f(7)}  P(7W(7)-f(7)} 
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(2.3) 


=N{ty7,W-  0)} 

<<  NX^q  tr  ^7,9)  F-‘(9)  f*9(7,9)P(7,9) 
is  satisfied  simultaneously  for  all  75  T. 

For  the  natural  choice  of  P  =  {{q(i,0)  F  (0){q(^,0)} 1 ,  using  f  to  denote  the  pseudo 
inverse  of  the  covariance  of  (2.2),  the  inequality  (2.3)  becomes 

(f(7)  -  f(7)}*{f(/7,9)  -  f  (t)}  <i X^q 

where  r  =  Rank  (P) 

—1  ~  2 

The  relative  squared  spectral  error  tr[S  (u>){S(w)  -  S(u)}]  is  a  fundamental 
quantity  in  measuring  the  accuracy  of  a  spectral  estimation  procedure.  The  integral  of  this 
quantity  is  asymptotically  the  Kullback— Lei  bier  information  of  negative  entropy 
(Larimore,  1983)  which  is  a  fundamental  statistical  measure  of  model  approximation  error. 
The  expected  value  of  the  integral  is  proportional  to  the  number  of  estimated  parameters 
divided  by  the  sample  size  (Larimore,  1982).  From  Theorem  1,  simultaneous  confidence 
bands  on  the  sample  relative  squared  spectral  error  are  given  by  the  following  theorem. 

Theorem  2.  Under  the  conditions  of  Theorem  1,  as  N-<  ®  ,  the  probability  is  at  least 
1  —  a  that  simultaneously  for  all  ueCl  the  sample  squared  relative  spectral  error  is  bounded 
as 

tr[S  1(c^){S(o/)  -  S ( a^) } ] 2 
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<X^  tr  qS-V.WS^-SM}]2 

a’q  k  ,1 

I 

where  {gk  /0)}  =  G  =  F ~\H). 

Proof:  Asymptotically  S (cj)  and  S(ui)  are  equal  so  that  we  may  consider  its  inverse 
in  (2.3)  a  constant  denoted  5>(w).  To  apply  Theorem  1,  we  consider  the  Hermitian  matrix 
A (cj)  =  S  ^  {S(w)  —  S(w)JS  ^2(w)  and  express  the  squared  relative  error 
symmetrically  as 

tr[S—1(u;){S(a;)  —  S(cj)}]2  =  tr{B_1/2(w)  JS-1/2(o/)]2 


^  £  * 

=  tr  AA  =  tr  AA  =  E  a-,  a.  ■  =  f  (w)f(w) 


where  f  (u)  =  vecA(cj)  is  a  vector  containing  the  elements  of  the  matrix  A(w).  Application 
of  Theorem  1  to  the  vector  function  f  (w)  and  rearrangement  as  in  (2.4)  proves  the 
inequality.  Expanding  S (u,0)  as  in  (2.4),  the  equality  follows  from 

E  tr[S_1(w){S(w)  -  S  (u;)}2 

1  dS(ui,0)  ~  -  T  «  <9S(w,0) 

=tr  ES-1(u j,0) - E (O-OMO-O)1  S ~\u,0) - ™ - 

k,j  dOv  dUl 

"  *  rr  _i 

and  using  E (0  —  0)(0  —  0)  =  F  from  the  asymptotic  efficiency  of  0 

In  principle  any  quadratic  form  in  the  components  of  the  spectral  matrix  could  be 
used  as  in  Theorem  1  by  introducing  a  weighting  matrix  P(cj,0).  For  confidence  intervals 
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on  the  spectral  matrix,  the  weighting  of  the  inverse  covariance  of  the  error  in  estimating 
the  spectral  matrix  gives  tightest  confidence  bands  which  can  be  expressed  as 


*  -  5vecS(u;)  *  dvec  S(w)  , 

vec  {S(u)  -  S(w)}  {£ - ^ - g k/0) - — - }Ivec{S(w)} 


M 


dO 


l 


For  a  given  confidence  level  a,  this  gives  a  simultaneous  confidence  band  for  all  frequencies 
ufl Cl  as  a  quadratic  form  in  the  elements  of  S(w)  —  S(u), 


5.3  Entropy  and  Spectral  Accuracy 


Consider  the  following  predictive  inference  setting  involving  an  observed 
T  T  T  T  T 

informative  sample  u  =(x  (l),y  (l),...,x  (N),y  (N))  of  size  N  used  to  estimate  the 
process  model,  and  similarly  consider  a  conceptual  predictive  sample  v  of  size  M  used  to 
evaluate  the  accuracy  of  the  estimated  model.  The  predictive  sample  is  assumed  to  be 
identically  distributed  by  independent  of  the  informative  sample.  Consider  the  problem  of 
inference  on  the  parametric  class  {p(v,0),0e0}  of  models  with  probability  densities  p(v,0) 
based  upon  the  informative  sample  u.  Consider  the  conceptual  repeated  sampling 
experiment  where  on  each  trial  the  samples  u  and  v  are  each  drawn  independently  from  the 
process  S(u,0+)  with  0*  assumed  to  be  the  true  parameter  value.  An  estimative  model 
p=p(v,0(u))  is  chosen  for  the  density  of  v  by  some  parameter  estimation  scheme  0(u).  The 
expected  negative  entropy,  also  known  as  the  expected  Kullback-Leibler  discrimination 
information  or  expected  I— divergence,  is  a  measure  of  the  error  in  approximating  the  true 
density  p+  of  v  by  the  estimate  p  and  is  given  by 


R(P*,P)  =  EuK(p*,p)  =  Eu  }  p(v,0*)  log 


P(v,0*) 

- x - dv 

p(v,0(u)) 


(3.1) 
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I 

where  Eu  denotes  expectation  relative  to  u  and  K  denotes  the  Kullback— Leibler 
information.  The  negative  entropy  measure  follows  as  the  natural  measure  in  the 
predictive  inference  setting  from  the  fundamental  principles  of  sufficiency  and  repeated 
sampling  (Larimore  (1983)).  This  approach  applies  to  very  general  modeling  methods  such 
as  nonparametric,  semi  parametric  or  parametric  procedures  as  well  as  methods  including 
!  decisions  on  model  structure  or  order  such  as  those  used  for  AR  and  ARMA  modeling. 

I 

,  Let  lower  case  variables  denote  a  sample  of  size  M  of  the  predictive  sample,  e.g. 

y=(y  (l),y  (2),...,y  (M))  and  qyy  denote  the  covariance  matrix  of  y.  By  expressing  the 
density  p(y,x;0)=p(y-r;0)p(x;0)  in  terms  of  the  conditional  random  process  q(t)=y(t)-r(t), 
the  log  likelihood  separates  with  the  density  of  x(t)  in  many  problems  not  a  function  of  the 
unknown  parameters  or  at  least  a  function  of  a  separate  set  of  parameters.  The 
I— divergence  (3.1)  thus  becomes 

PM*)  dx 
p(x,0*)  log  --  ■»-- 
p(x,<?) 

=  K(p*(q),p(q))  +  K(p*(x),  p(x))  (3.2) 

This  conditional  viewpoint  is  tken  in  the  following  where  only  the  first  term  of  the 
I-divergence  is  considered.  Inclusion  of  the  second  term  is  tantamount  to  modeling  the 
joint  vector  time  series  involving  the  two  series  x(t)  and  y(t)  jointly  rather  than  as 
exogeneous  and  endogeneous  respectively.  The  joint  case  is  included  as  a  special  case  of 
y(t)  a  vector  process  with  no  input  x(t)  which  will  be  discussed  as  a  particular  instance  of 
the  model  throughout  the  paper.  One  further  expression  for  the  I-divergence  will  be  very 
useful 


p(q.0*) 

K(p*p)  =  j  p(qA)  log - * - dq  +  J 

PM) 
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K(p*(q),p(q))  =  E  log  p(y-r,Eqq)  +  E  log  p(y-r,  Eqq) 

=  E  log  p(y-r,Eqq)  +  E  log  p((y-r)+(r-r),Eqq) 

=  E  log  p(y— r,Eqq)  +  E  lop  g((y-r),  +  E(r-r)TEqq(r-r) 
where  E  denotes  expectation  with  respect  to  the  density  p*. 

Let  S  denote  an  estimate  of  S.  We  will  need  to  assume  tht  S(w)  is  continuous  and 
that  Sqq(o;,0)  and  Sqq(w,  0)  are  positive  definite  for  u>e[— In  the  discussion,  the 
redictive  sample  v  will  be  considered  to  be  conditional  on  x(t)  and  to  have  an  infinite 
sample  size  M.  This  will  require  the  normaalization  of  the  negative  entropy  and 
I-divergence  by  the  sample  size  M.  The  I— divergence  per  sample  time  conditional  on  x(t), 
which  we  will  denote  I(S,S)  and  call  I-divergence  for  brevity,  can  be  expressed  using  (3.2) 
as  (Kazakos  &  Papantoni— Kazakos,  1980) 

I(S,S)  =  1  im— K(p(vM  0*),p(vM  0(uN)) 

Mho)M 

=  -  ^HlogiS^^S^fo,)  |  +  tr[I  -  | } -g 

-  )  -  HMISS^M  -  H(W)]*}-^  (3.4) 

where  the  subscript  emphasizes  that  the  sample  of  size  M  of  v  becomes  infinite.  The 
negative  entropy  per  sample,  or  negentropy  for  brevity,  is  defined  as  N(S,S)  = 
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1  imR(p*,p).  Note  that  the  I-divergence  is  composed  of  two  terms,  the  last  due  to  the 
Mil 

error  estimaating  the  transfer  function  H(w)  and  the  first  due  to  the  error  in  estimating  the 
spectrum  Sqq(w)  of  the  noise  q(t).  A  useful  approximation  for  the  first  term  in  (3.4)  is 


HIog|S(NMS(!q-1M|  +  tr[I-SqqMSqq-V)l}-5V 


1  J  MS^M[SqqM  -  Sqq(“)l  }2~Jl 


which  holds  to  second  order  in  the  elements  of  S  as  is  easily  shown  by  comparing  first 
and  second  derivatiives  of  the  integraands.  This  is  a  generalization  to  the  multivariate  case 
of  the  integral  of  the  squared  relaative  error.  Thus  the  I-divergence  is  approximately  a 
quadratic  form  in  the  estimation  errors  of  S^(af)  and  H  (w),  and  these  quadratic  forms  do 
not  intereact,  i.e.  there  are  no  cross  terms. 


5.4  Normalized  Spectral  Error  in  Principal  Components 

In  the  mulple  time  series  case,  the  spectral  measure  has  an  intuitive  interpretation 
in  terms  of  principal  components  of  the  power  spectrum  in  the  frequency  domain. 
Principle  component  representations  of  the  spectral  matricies  S^u)  and  Sqq(u;)  have  the 
form. 


JMSqqMJ*M  =  DM .  lms^Ml'm  =  EM  (4.1) 

where  J(u;)  and  L(u;)  given  as  a  functionof  frequency  u  are  unitary  matrix  transformatons 
*  * 

so  Jfu/IJ  (w)=I=L(u;)L  (u>)  which  diagonalize  and  Sqq(u>)  repectively  and  where 
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* 

where  G(u/)=J(u;)H(u;)L  (w)  is  the  transfer  function  H(w)  expressed  inthe  coordinate  frame 
of  the  principal  component  series  x(t)  and  y(t).  The  squared  magnitude  error  j  G-^(u)  - 
Gij(w)| 2  in  the  i,  j  element  of  the  transfer  function  is  weighted  by  the  input  signal  to 
output  noise  ratio  D^-E  -  -  for  the  pair  i,j. 

J  J 
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CHAPTER  6 


6.  DETECTION  OF  ABRUPT  MODEL  CHANGES 

In  this  section  we  apply  the  tools  developed  so  far  to  the  problem  of 
abrupt  change  detection.  We  then  present  some  experimental  results. 

6. 1  Algorithm  Development 

Consider  the  situation  depicted  in  Figure  6.1,  in  which  the  true  time- 
series  model  changes  from  Mg  to  Mj^  at  time  t— dj,  where  t  is  the  present 
time.  Suppose 


t-dg  t-d^ 


Figure  6.1  Changing  Time  Series  Model 

that  we  have  data  back  to  time  t-dg  and  that  the  true  model  is  Mo  in  the 
interval  (t-dg,  t~dj[). 

We  wish  to  detect  this  change  in  the  model.  In  this  example,  fitting  a 
single  model  to  data  over  the  interval  (t-dg,  t)  should  result  in  greater 
fit  errors  than  fitting  one  model  over  the  interval  (t— dg,  t— dj)  and 
another  model  over  the  interval  ( t-d i ,  t).  The  crucial  issue  is  to 
determine  an  appropriate  selection  measure  so  as  to  be  sensitive  to 
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changing  models,  while  at  the  same  time  not  being  too  sensitive  to  noise. 
Over-sensitivity  to  noise  will  result  in  deciding  that  model  changes  have 
occurred  when,  in  fact,  they  have  not.  Low  sensitivity  to  model  changes 
will  result  in  missing  changes  which  have  occurred  in  the  model.  Another 
obvious  problem  is  how  best  to  select  the  testing  intervals,  (t-dg,  t)  and 
(t-di,  t),  to  minimize  the  time  required  to  achieve  accurate  detection.  We 
consider  first  the  selection  measure. 

Over  the  interval  (t-do,  t)  we  can  find  the  model  which  minimizes  the 

AIC: 

d°  ~k 

AIC  (k)  =  -  2  l  log  p(e(t-d0  +  i)  |  0  )  +  2  M(k)  (6.1) 

i=l 

where  M(k)  is  the  number  of  independently  adjustable  parameters  and  where  we 
have  assumed  a  sampling  time  increment  of  one,  for  convenience. 

If  we  now  divide  the  interval  (t-dQ>  t)  into  two  subintervals, 

(t-do,  t-dj)  and  (t-di,  t) ,  we  determine  minimum  AIC  models  for  each  subin¬ 
terval 


AIC0(k) 


i=l 


,  -k 

log  p(e(t-dg  +  i)  |  9  )  +  2  M(k) 


(6.2) 


Aiq(k) 


2 


u  „k 

l  log  p(e( t-dg  +  i)  [  0  )  +  2  M(k) 

i=dg-dj+l 


(6.3) 


Now  assume  that 
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k*  =  arg  min  AIC(k) 
ko*  =  arg  min  AICg(k) 
ki*  =  arg  min  AlC^(k) 


[ 


and  lec  the  corresponding  models  be  parametrized  by 


«k* 

G 


* 


respectively. 

Then  the  model  selection  criterion  is  based  on  comparing  AIC(k*)  with 
AICg(kg*)  +  AIC(k^*)  and  selecting  the  model(s)  which  give  the  least  value. 
We  can  simplify  the  calculation  in  the  case  that  dg  >  >  d^  and  the  model 
does  not  change  too  much,  in  which  case  we  expect  that  k*  =*  kg*, 

0  *  this  case  we  can  define  the  AIC  difference  as 


AAIC*  =  AIC(k*)  -  AICg(kg*)  -  AIC^^*) 

?°  i  -k* 

=  -  2  l  log  p( e( t-dg  +  i)  |  0 

i=dg-di+l 

V°  '  kl*  * 

+2  l  log  p(e( t-dg  +  i)  j  ©i  )  -  2  M(k^  )  (6.4) 

i=dg-di+l 


and  the  decision  rule  is 


<  0  ;  declare  "no  change 

AAIC*  -I 

[^>  0  ;  declare  "change" 
Note  that  AAIC*  may  be  written  as 


(6.5) 


AAIC* 


u0 

2  l 

i=ldg-d]+l 


,  "  kl* 

p(e( t-dg+i)  0j  ) 

log  - - - 

p(e(t-dg+i)  j  Qk*  ) 


-  2  M(kt*) 


(6.6) 
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which  is  the  likelihood  ratio  in  favor  of  the  best  model  in  the  interval 
(t-dj,  t)  to  the  best  historical  model  evaluated  over  the  sane  interval, 
but  biased  off  by  the  number  of  parameters  of  the  best  model  in  the  inter¬ 
val  (t-dj,  t). 

If  we  specialize  this  result  to  the  linear  prediction  problem  under 
study  here,  we  see  that 

AAIC*  «  di  {log  |  S(k*)  |  +  tr  S(k*)  S(k*)  1 } 

-  dj  {log  |  SjCkj*)  j  +  m} 

-  2  mk i *  (6.7) 

where  S(k*)  is  the  theoretical  covariance  matrix  of  prediction  errors  for 
the  historical  model  fitted  on  the  interval  (t-do,  t-d^)  ,  S^Cki*)  is  the 
theoretical  covariance  matrix  of  prediction  errors  for  the  model  fitted  to 
the  data  on  the  interval  (t-d^,  t)  ,  and  S(k*)  is  the  actual  covariance 
matrix  of  prediction  errors  for  the  historical  model,  evaluated  on  the 
interval  (t-d^,  t).  Now  let  AS(k*)  =  S(k*)  -  S(k*). 

Then 


AAIC*  =  di  log  J 


S(k*) 

ovn 

*  *™ 

1. 

sl(ki* 

) 

1 

-  2  mkj 
di 


Thus  our  decision  parameter  is 


+  tr  AS(k*)S(k*)  | 

(6.8) 
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and  the  decision  rule  is 


{<  0  ;  declare  "no  change" 

(6.10) 

>  0  ;  declare  "change" 

6.2  Experimental  Results 

The  abrupt  change  detector  was  tried  on  a  changing  autoregressive 
model.  On  the  interval  t  £  [l,  dg ] ,  the  actual  model  was 

y(t)  =  1.65  y(t-l)  -  0.665  y(t-2)  +  u(t)  (Model  1)  (6.11) 

where  u(t)  was  zero-mean  white  Gaussian  noise  with  variance  of  1.  This 
model  has  two  real  stable  poles  at  0.95  and  0.7.  The  actual  model  was  then 
changed  to 

y(t)  =  2.5  y(t-l)  -  2.11  y(t-2)  +  0.595  y(t-3)  +  u(t)  (Model  2)  (6.12) 

on  the  interval  t  £  [do  +1,  dg  +  d,].  This  model  has  three  poles  at 
0.7,  0.9  +  0. 2i,  0.9  -  0.2i. 


The  first  trial  used  do  =  80,  d^  =  20.  The  resulting  covariance  matri¬ 
ces  on  the  interval  (1,  dg]  were 


11.0808 
10.9642 
10.7662 
10.5378 
. 10.2860 


10.9642 

11.0018 

10.8992 

10.7144 

10.4733 


10.7662 

10.8992 

10.9484 

10.8566 

10.6614 


10.5378 

10.7144 

10.8566 

10.9144 

10.8144 


10.2860 
10.4733 
10.6614  f 
10.8144 
10.8619  . 


'  11.0439 
10.8318 
10.5900 
10.3510 
_ 10.0978 


10.9089 

10.6534 

10.4015 

10.1607 

9.9061 


10.7124 
10.4500 
10.1993 
9 . 9542 
9 . 6859 


10.4822 

10.2258 

9.9753 

9.7122 

9.4296 


S 


ffl 


' 11.1612 
11.1214 
10.9673 
10.7430 
„ 10 . 4764 


11.1214 

11.2356 

11.1760 

10.9937 

10.7336 


10.9673 

11.1760 

11.2697 

11.1831 

10.9711 


10.7430 

10.9937 

11.1831 

11.2537 

11.1474 


By  performing  an  SVD,  we  obtain 


0.6655 
0.4239 
-  0.3888 
0.3596 
0.3112 

1 


0.4685 

-0.2456 

0.2135 

-0.1033 

-0.8148 


-0.4465 

0.7166 

0.3803 

-0.1198 

-0.3579 


0.3351 

0.4911 

-0.7320 

-0.3149 

-0.1072 


,2182 

0 

0 

0 

0 


0 

0.1863 

0 

0 

0 


0 

0 

0.1032 

0 

0 


0 

0 

0 

0.0207 

0 


10.2222  ’ 
9.9723 
9.7098 
9.4267 
9.1241 v 


10.4764 
10.7336 
10.9711  f 
11.1474 
11.2135  j 


-0.1607 

0.0726 

-0.3504 

0.8640 

-0.3157 


0 

0 

0  . 
0 

0.0165 
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0.4605 

-0.6945 

0 . 5094 

-0.1666 

0.1355 

0.4569 

-0.2407 

-0.4337 

0.4939 

-0.5489 

0.4493 

0.0706 

-0.5097 

-0.0174 

0.7301 

0.4398 

0.3337 

-0.0833 

-0.7382 

-0.3787 

0.4289 

0.5860 

0.5344 

0.4279 

0.0627 

The  resulting  values  of  AIC(k)  for  different  orders  k  are,  neglecting 


constants , 


AIC(l)  =  -  2.150 
AIC(2)  =  -  2.264 
AIC(3)  =  -  2.243 
AIC(4)  =  -  2.195 


so  that  k* 


2,  which  is  the  correct  order,  is  selected.  The  estimated 


model  is  y(t)  -  1.843  y(t-l)  -  1.0081  y(t-2). 


On  the  interval  [do  +  1,  dg  +  di ] ,  the  covariance  matrices  were 


1.7647 

1.7624 

1.6008 

1.3214 

1.1104 

1.7624 

1.9451 

1.9171 

1.7039 

1.4781 

1.6008 

1.9171 

2.0772 

2.0067 

1.8375 

1.3214 

1.7039 

2.0067 

2.1332 

2.0981 

1.3214 

1.7039 

2.0067 

2.1332 

2.0981 

1.6394 

1.4902 

1.4677 

1.6660 

2.1622 

1.4905 

1.2481 

1.1833 

1.3743 

1.8518 

1.2626 

1.0238 

1.0057 

1.2397 

1.7310 

0.9897 

0.8111 

0.8554 

1.1288 

1.6058 

0.8331 

0.7084 

0.7783 

1.0178 

1.4219 
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f  f  2 


1.7259 

1.7733 

1.9101 

2.2212 

2.7981 

1.7733 

2.1250 

2.5776 

3.1739 

3.9927 

1.9101 

2.5776 

3.4770 

4.5339 

5.7802 

2.2212 

3.1739 

4.5339 

6.1817 

8.0268 

2.7981 

3.9927 

5.7802 

8.0268 

10.5912 

Performing 

the  SVD 

yields. 

* 

0.9394 

-0.1382 

-0.2689 

0.0560 

-0.1513 

0.0073 

-0.8145 

0.4181 

0.3600 

0.1791 

u2  =  . 

0.1124 

-0.1507 

0.5108 

-0.7606 

-0.3538 

0.2936 

0.5410 

0.6957 

0.3074 

0.2065 

0.1361 

<. 

-0.0455 

-0.0891 

-0.4408 

0.8816 

r 

3.5041 

0 

0 

0 

0 

0 

0.6770 

0 

0 

0 

°2  =  < 

0 

0 

0.2473 

0 

0 

0 

0 

0 

0.0326 

0 

0 

0 

0 

0 

0.0094 

• 

0.3352 

-0.7554 

0.4483 

-0.2558 

0.2250 

0.3611 

-0.3926 

-0.3766 

0.6815 

-0.3305 

V  = 

0.4033 

-0.0185 

-0.6095 

-0.6546 

-0.1926 

0.4766 

0.2819 

-0.1621 

0.2039 

0.7909 

0.6062 

0.4422 

0.5093 

0.0107 

-0.4213 

The  resulting  values  of  AIC(k)  are,  again  neglecting  constants 


AIC(l)  =  -  0.960 
AIC( 2 )  =  -  2.266 
AIC( 3)  =  -  2.323 
AIC( 4 )  =  -  2.223 
AIC(5)  =  -  2.123 


so  thac  k*  =  3,  as  desired. 


The  estimated  model  is 


y(t)  =  1.9456  y(t-l)  -  1.1485  y(t-2)  +  0.0483  y(t-3) 


Note  that,  with  the  sparse  amount  of  data  available,  the  coefficient 


errors  are  relatively  large  and  the  two  estimated  models  are  relatively 
close  to  each  other. 


The  AAIC  criterion  was  used  to  test  for  a  change  in  the  time  series 
coefficients.  Since  we  have  only  one  output,  the  criterion  is 


AAIC  =  log 


S(k*)  +  S(k* )  -  S(k*) 

S^k^)  S(k*) 


2  kl* 
dl 


(6.  i: 


Using  k*  =  2,  kL*  =  3,  S(k*)  =  .0940,  9i(k!*)  =  .0726,  S(k*)  =  .0903, 
di  =  20  yields, 


AAIC  =  0.2583  -  0.0394  -  0.3  =  -  0.0811 
so  that  a  "no  change"  decision  is  made,  but  just  barely.  Note  that  the 
actual  covariance  on  the  second  interval  using  Model  #1  Is  actually  less 
than  for  the  first  interval,  as  a  result  of  using  only  a  small  testing 
interval. 


We  next  tried  the  test  over  larger  intervals,  keeping  a  4:1  ratio  bet¬ 


ween  the  historical  interval  and  the  testing  interval.  The  intervals  used 
were  160  for  the  historical  interval  and  40  for  the  testing  interval. 


The  covariance  matrices  for  the  historical  interval  were 


S 


ppl 


S 


pfl 


S 


ffl 


6.4177 

6.3504 

6.2021 

6.0182 

5.8335  | 

6.3504 

6.4177 

6.3504 

6.2022 

6.0183 

4 

6.2021 

6.3504 

6.4170 

6.3494 

6.2006 

6.0182 

6.2022 

6.3494 

6.4153 

6.3470 

5.8335 

6.0183 

6.2006 

6.3470 

6.4119 

d 

• 

6.3504 

6.2014 

6.0149 

5.8244 

\ 

5.6495 

6.2013 

6.0147 

5.8242 

5.6493 

5.4869 

=  «< 

6.0169 

5.8282 

5.6544 

5.4919 

5.3310 

5.8316 

5.6606 

5.4999 

5.3387 

5.1657 

5.6655 

1 

5.5089 

5.3502 

5.1770 

4.9871 

6.4187 

6.3529 

6.2057 

6.0202 

5.8294 

6.3529 

6.4251 

6.3642 

6.2194 

6.0332 

—  «, 

6.2057 

6.3642 

6 . 4436 

6.3856 

6.2387 

6.0202 

6.2194 

6.3856 

6.4671 

6.4058 

5.8294 

6.0332 

6.2387 

6.4058 

6.4835 

The  results  of  the  SVD  were 


U 


1 


0.6995 
0.3997 
-  0.3573 
0.3481 
0.3197 


0.4349 

-0.7769 

-0.0762 

0.3521 

-0.2785 


-0.2706 

0.1851 

0.5794 

0.3443 

-0.6620 


0.4797 
0.3162 
-0.3109 
-0 . 4460 
-0.6118 


0.1353 
-0.3202 
0.6589  ► 
-0.6613 
0.0879 


D 


5.3574 

0 

0 

0 

0 


0 

0.1611 

0 

0 

0 


0 

0 

0.1026 

0 

0 


0 

0 

0 

0.0233 

0 


0 

0 

0 

0 

0.0070 
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V 


0.4693 

-0.8139 

0.1915 

-0.2626 

0.1082 

0.4616 

-0.1011 

-0.5037 

0.6123 

-0.3847 

=  4 

0.4489 

0.3182 

-0.4470 

-0.2359 

0.6647 

0.4342 

0.3743 

0.1072 

-0.5518 

-0.5962 

0.4204 

0.2933 

0.7059 

0.4426 

0.2075 

The 

values  of 

AIC  (k) 

were 

AIC(l)  =  -  2.3072 

AIC(2)  =  -  2.4871  — - 

AIC(3)  =  -  2.4795 
AIC(4)  =  -  2.467 
AIC(5)  =  -  2.4545 

so  that  k*  =  2  is  selected.  The  estimated  model  is 


y(t)  =  1.6777  y(t-l)  -  0.7178  y(t-2) 


which  is  much  closer  to  the  actual  model  (6.11),  due  to  the  increased  data  length 

The  model  over  the  testing  interval  was  next  found.  The  covariance  matri¬ 
ces  were 


PP2 


Pf2 


20.6940 

20.2655 

19.2378 

17.6822 

15.7286 

20.2655 

20.4909 

20.0699 

19.0219 

17.4532 

19.2378 

20.0699 

20.3055 

19.8666 

18.8081 

17.6822 

19.0219 

19.8666 

20.0831 

19.6334 

15.7286 

17.4532 

18.8081 

19.6334 

19.8397 

20.4906 

19.4580 

17.9266 

15.9897 

13.7888 


19.6758 

18.1453 

16.2349 

14.0540 

11.7306 


18.3328 

16.4322 

14.2789 

11.9786 

9.6332 


16.5805 

14.4431 

12.1690 

9.8473 

7.5625 


14.5504 

12.2933 

9.9935 

7.7296 

5.5657 


s. 


f  o 


'20.9398  20.7220  19.S651  18.^736  16.6762' 

20.7220  21.1467  20.3735  19.9608  18.5264 

*  19.8651  20.8735  21.2276  20.S927  19.9440  ’ 

18.4736  19.9608  20.8927  21.1353  20.8230 

\  16.6762  18.5264  19.9440  20.8230  21.0965 


The  SVD  yielded 


0.8905 

0.4123 

0.0935 

-0 . 1344 

-0.1008 

0.3705 

-0.4959 

-0.6851 

0.2227 

0.3128 

4 

0.2259 

-0.4936 

0.5293 

0.5778 

-0.3023 

0.1280 

-0.4076 

0.4613 

-0.5170 

0.5808 

0.0473 

» 

-0.4175 

-0.1701 

-0.5755 

-0.6807 

f 

9.7032 

0 

0 

0 

0 

0 

1.7256 

0 

0 

0 

0 

0 

0.0533 

0 

0 

=  < 

0 

0 

0 

0.0158 

0 

0 

0 

0 

0 

0.0117 

0.4557 

-0.6682 

0.5738 

-0.1019 

0.0734 

0.4670 

-0.2754 

-0.5643 

0.6055 

-0.1447 

=  < 

0.4617 

0.0760 

-0.4483 

-0.6411 

0.4112 

0.4410 

0.3630 

0.1371 

-0.2325 

-0.7752 

0.4081 

0.5832 

0.3640 

0.3974 

0.4504 

The  resulting  values  of  AIC(k)  were 


AIC(l)  =  0.3771 
AIC(2)  =  -  2.7253 
AIC(3)  =  -  2.7595 
AIC(4)  -  -  2.6753 
AIC(5)  =  -  2.6253 


so 


that  ki*  =  3  was  selected.  The  resulting  estimated  model  was 
y(t)  =  2.3931  y(t-l)  -  1.977  y(t-2)  +  0.7421  y(t-3) 
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which  is  much  closer  to  the  actual  model  (6.12)  than  the  estimate  based  on 
half  as  many  data  points. 

The  AAIC  criterion  was  then  applied  to  test  for  a  model  change.  Using 
(6.13)  with  k*  =  2,  ki*  =  3,  S(k*)  =  0.0811,  S^k^)  =  0.0564, 

S(k*)  =  0.1119,  we  get 

AAIC  =  0.3632  +  0.3801  -  0.15  =  0.5933  (6.14) 

which  yields  a  "change"  decision.  By  comparing  (6.14)  to  (6.13)  we  note 
several  things.  The  first  term,  which  is  log  S(k*)  -  log  Si(k^*)  now  more 
strongly  indicates  a  change,  due  to  better  model  fit.  The  second  term, 
which  is  the  effect  of  modeling  error  on  the  measured  error  covariances, 
also  more  strongly  indicates  a  change  due  to  increased  data  length,  which 
produces  a  more  accurate  estimate  of  the  true  error  covariance  during  the 
testing  interval,  using  the  “no  change"  hypothesis.  Finally,  the  last  term 
more  strongly  indicates  a  change,  since  the  bias  for  a  "no  change"  decision 
Is  reduced  due  to  increased  data  length.  Thus  we  see  that  all  three  terms 
in  the  AAIC  criterion  contribute  to  the  final  decision,  and  each  one  is  of 
importance  in  achieving  an  accurate  decision. 
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CHAPTER  7 


OPTMAL  ADAPTIVE  IDENTIFICATION  OF  CHANGING  SYSTEMS 

7.1.  Introduction 

A  fundamental  problem  in  tracking  a  target  performing  evasive  manuvers  is 
adaptation  to  changes  in  the  dynamical  characteristics  of  the  target  motion.  In  previous 
approaches  to  target  modeling,  simplistic  models  have  largely  been  used  which  do  not  take 
into  account  the  changing  characteristics  of  the  target  motion  as  different  manuvers  are 
performed  by  the  target.  To  improve  upon  these  methods  requires  the  development  of  new 
methods  that  are  able  to  adapt  to  the  changes  in  target  motion  characteristics  which  may 
be  either  slowly  varying  or  abrupt. 

Previous  approaches  to  adaptive  identification  and  detection  of  abrupt  changes  in 
systems  have  had  a  number  of  deficiencies.  Adaptive  tracking  of  slow  changes  has  been 
largely  adhoc  and  not  based  upon  a  sound  statistical  theory.  Much  of  this  work  has  been 
done  in  the  context  of  recursive  identification  using  exponential  weighting.  This  is  of 
course  very  attractive  from  a  computational  point  of  view.  For  detecting  abrupt  changes 
the  literature  of  failure  detection  is  applicable  primarily  to  the  comparison  of  a  limited 
number  of  specific  simple  hypotheses  involving  jumps  in  the  states  or  simple  actuator  or 
sensor  failures.  This  does  not  include  the  case  of  a  dynamical  system  where  abrupt  changes 
can  occur  in  the  characteristics  of  the  dynamics.  The  difficulty  is  the  vast  number  of 
possible  changes  in  dynamical  structure  and  order  that  can  occur.  A  further  problem  is 
that  the  change  can  occur  at  an  arbitrary  time  which  requires  the  comparison  of  a 
multitude  of  nonnested  hypotheses  which  is  not  delt  with  by  classical  hypothesis  testing 
theory. 
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The  objective  of  this  paper  is  to  discuss  these  current  unsolved  problems  in  adaptive 
systems  and  to  propose  a  new  approach  which  is  believed  to  solve  the  problem  of 
adaptation  for  changing  systems  in  a  fundamental  way.  These  current  problems  have  been 
discussed  to  some  extent  in  the  recent  dissertation  of  Hagglund  (1983),  with  some 
approaches  proposed  in  the  context  of  recursive  prediction  error  identification.  In  that 
work,  a  number  of  particular  adaptive  identification  and  detection  problems  were  defined, 
and  the  desired  properties  of  the  solution  were  discussed.  The  proposed  solutions  were 
shown  to  work  on  simple  low  order  systems.  However,  a  number  of  problems  were  not 
addressed  that  occur  in  higher  order  and  multivariable  systems.  These  difficulties  in 
include  the  lack  of  invariance  of  the  procedures  in  general,  the  nonexistence  of  a  global,  and 
the  lack  of  a  procedure  of  the  determination  of  an  appropriate  model  state  order.  These 
are  indeed  very  difficult  problems  some  of  which  have  not  been  completely  solved  even  for 
the  offline  case. 

In  this  paper  we  share  much  of  the  intent  stated  in  Hagglund,  but  take  a  much  more 
general  approach  to  solving  these  problems.  The  Kullback—Leibler  information  (1959)  or 
entropy  measure  of  model  approximation  error  has  recently  been  shown  to  follow  naturally 
from  the  fundamental  statistical  principles  of  sufficiency  and  repeated  sampling  in  a 
predictive  inference  context  (Larimore,  1983).  The  Akaike  information  criterion  (Akaike, 
1973)  is  an  unbiased  estimate  of  the  entropy  measure  which  is  optimal  for  large  samples 
(Shibata,  1981).  In  this  paper,  the  entropy  and  predictive  inference  approach  is  applied  to 
the  problems  of  adaptive  tracking  of  slowly  changing  processes  and  abrupt  changes.  This 
requires  the  extension  of  the  AIC  procedure  to  the  case  of  comparing  different  models  over 
different  data  intervals  whereas  the  AIC  procedure  was  originally  developed  for 
comparisons  of  different  models  on  the  same  data  interval.  The  concepts  and  notation 
developed  in  Larimore  (1983,  1985)  is  used  in  the  development. 
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The  structure  of  the  paper  is  as  follows.  The  approach  to  the  general  problem  of 
adaptation  is  discussed  in  the  next  section.  To  generalize  the  AIC  procedure,  some 
properties  of  maximum  likelihood  estimators  are  derived  for  nonnested  classes  of  models  in 
section  7.3  Section  7.4  gives  a  generalization  of  the  AIC  estimate  of  negentropy.  Section 
7.5  discusses  the  use  of  this  generalized  AIC  in  adaptative  tracking  while  section  7.6 
describes  its  use  in  abrupt  change  detection. 

7.2.  Approach  to  Adaptation 

The  use  of  nredictive  inference  and  entropy  concepts  and  methods  is  the  basic 
approach  taken  in  adaptation  to  system  changes.  To  formulate  the  problem,  consider  a 
division  of  a  time  span  into  a  set  of  disjoint  intervals  whose  union  is  the  whole  time  span. 
With  each  time  interval  we  associate  a  model  for  the  dynamical  system  which  is 
determined  from  the  observed  data  using  a  particular  model  selection  method.  In  this 
approach,  different  divisions  of  the  time  span  into  time  intervals  are  considered  as  well  as 
different  model  selection  procedures.  This  allows  the  consideration  of  very  general  models 
that  include  slow  as  well  as  abrupt  changes  as  described  below.  The  details  of  this  are 
given  in  Sections  7.5  and  7.6,  with  a  brief  overview  of  the  concepts  given  here. 

Consider  first  the  case  of  tracking  slow  changes.  Suppose  a  given  time  span  is 
divided  into  a  set  of  time  intervals.  Several  different  intervals  sets  will  be  considered 
involving  divisions  of  the  time  span  using  different  interval  lengths.  On  each  time  interval 
of  each  interval  set,  a  best  model  will  be  determined  by  choosing  a  model  from  a  class  of 
models  using  say  the  AIC  procedure.  The  class  of  models  considered  can  include  different 
state  orders  and  other  structural  characteristics.  For  each  set  of  intervals,  a  composite 
model  consists  of  the  models  associated  with  the  various  intervals  of  the  interval  set. 
Composite  models  for  several  interval  sets  can  be  compared  to  determine  which  division  of 


the  time  span  into  intervals  leads  to  the  bet  composite  model.  If  the  intervals  are  chosen 
to  be  too  long,  then  changes  in  the  process  during  the  intervals  will  lead  to  increased 
prediction  error.  On  the  other  hand  if  the  time  intervals  are  too  short,  then  variability  in 
observations  and  the  use  of  an  increased  number  of  total  parameters  in  the  composite 
model  will  also  increase  the  prediction  error.  As  a  result,  there  will  be  an  optimal  choice  of 
division  of  the  time  span  into  time  intervals.  In  practice  it  is  not  necessary  to  obtain  the 
optimum  division  but  only  a  good  approximation  which  will  determine  an  approximately 
optimal  model  update  rate  for  re— identifying  the  model  dynamics  and  noise  process 
characteristics. 

Now  consider  the  case  of  abrupt  change  detection,  suppose  that  the  above  tracking 
of  slowly  changing  process  characteristics  is  done  and  the  optimal  slowly  changing  model  is 
identified.  Then  suppose  that  an  abrupt  change  occurs  at  an  unspecified  time  following  the 
end  of  the  time  span  used  in  tracking  the  slow  changes  but  within  the  update  rate  used  in 
slow  tracking.  We  consider  the  new  time  span  that  includes  the  failure  and  consider 
divisions  of  it  into  intervals.  On  each  interval,  a  model  is  fitted  using  say  AIC  to 
determine  a  model  from  a  class  of  models.  Now  the  principle  question  is  whether  the  fitted 
models  on  the  various  intervals  are  significantly  different  from  the  last  model  chosen  by  the 
slowly  changing  adaptive  procedure. 

To  actually  make  the  proposed  comparisons  for  adaptive  tracking  and  abrupt 
change  detection  requires  the  development  of  new  results  in  predictive  inference  since 
different  models  are  being  compared  across  different  time  intervals,  and  non— nested 
multiple  comparisons  are  also  involved.  Neither  of  these  cases  is  considered  in  the  previous 
theory  of  AIC  and  related  predictive  measures.  This  theory  is  developed  in  the  following 
sections. 
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7.3.  Constrained  Maximum  Likelihood  Estimation 

In  this  section,  properties  of  the  maximum  likelihood  parameter  estimates  are 
developed  for  the  case  that  the  true  probability  model  is  not  contained  in  the  class  of 
parameterized  densities  that  are  considered  for  inference.  The  classical  development  of  the 
asymptotic  consistency  and  minimum  variance  of  maximum  likelihood  estimators  is  for  the 
case  where  the  true  density  id  contained  in  the  parametric  class. 

The  predictive  inference  framework  as  in  Larimore  (1985)  is  adopted  here  with 

p(q,0)  the  parameterized  probability  density  where  0  is  a  vector  of  parameters,  q  is  the 

informative  sample  and  r  is  the  predictive  sample.  Suppose  that  the  parameter  vector 
T 

0  ={0y0y.)  is  a  finite  or  infinite  set  of  parameters,  and  for  each  subset  of  distinct 
positive  integers  k=(kp...,km)  consider  the  subspace©  ^  of  0  such  that  only  the 
corresponding  0^  , 

class  of  models  Ck={p(q,^),60j{},  These  classes  of  models  are  in  general  nonnested  so 
that  we  do  not  in  general  have  C^cCj  or  CjCC^-  The  maximum  likelihood  estimator  for  the 
class  will  be  denoted  as  Oq). 

The  development  of  the  maximum  likelihood  theory  is  straight  forward  for  the  case 
where  Taylor  series  expansions  are  possible.  This  holds  under  the  following  regularity 
conditions  (Cox  &  Hinkley,  p.  281): 

(i)  The  parameter  space  is  closed  and  compact. 

(iij  The  probability  distributions  defined  by  any  two  different  values  of  0  are  distinct. 


are  nonzero  where  ft  denotes  a  member  of 0  and  let  be  the 


7-5 


(iii)  The  first  three  derivatives  of  the  log  likelihood  lq,0)  with  respect  to  0  exists  in  the 
neighborhood  of  the  true  parameter  value  almost  surely.  Further,  in  such  a  neighborhood, 
n  *  times  the  absolute  value  of  the  third  derivative  is  bounded  above  by  a  function  of  q, 
whose  expectation  exists. 


In  particular,  these  conditions  permit  the  interchange  of  expectation  and  differentiation  up 
to  second  order. 


In  the  discussion  various  order  models  are  considered,  and  the  relationships  between 
the  various  orders  is  developed.  The  log  likelihood  function  of  the  informative  sample  q 
will  be  denoted  by  f(q,0),  and  the  gradient  row  vector  and  Hessian  matrix  denoted  t{q,0) 
and  f"(q ,0)  respectively.  Expectation,  denoted  E,  will  be  with  respect  to  the  true  density 
p(q ,0)  unless  stated  otherwise  where  0  onto  0^  as  the  parameters 0  ^60^  minimizing  the 
negentropy  Rq  relative  to  the  informative  sample  q 

Rq«/)  =  E<(q/)  (7.1) 

At  the  minimum  a,  the  gradient  of  (3—2)  is  zero  so  from  the  regularity  conditions 

Ef(q,flk)  =  0, 


and  the  minimum  is  unique  if  and  only  if  the  expected  Hessian,  denoted  D^=Ef,'(q,^t). 
Expanding  (7—1)  in  a  Taylor  series  gives  a  second  order  expression  for  the  information 
distance  which  holds  asymptotically  for  large  sample  size  of  the  informative  sample 


Rq(fl/)  =  E[f(^)  -  f(^)]  +  E [t{0)  - 
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=  -  E^X^  -  ?)]  -  EK^  -  -  ~h)  +  E/(0)  -  ift)) 

2 

2 

=  -511  (M'iidJ  +  w).h 

To  determine  the  moments  of  the  maximum  likelihood  estimates  0^ ,  consider  the 
first  order  equality 

0  =  t\ q,^)  =  ^1(q,^k-?c)Tr(q)^k) 

Taking  expectation  with  respect  to  the  true  density  and  using  (7—2)  gives  the  equation 

Dk(Eik-0k)  =  0 

that  holds  asymptotically  for  large  informative  sample  N.  For  identifiable,  i.e. 

It 

unique,  D^is  nonsingular  which  implies  that  to  first  order 

Now  using  (7-4),  the  covariance  of  the  estimation  error  is 

E(0k-0k)(0k-0k)T  =  (Dk)_1E{flT(q,^k)£1(q,^k)}(Dkr1 
Note  that  in  the  unconstrained  case,  the  middle  term  which  is  the  Fisher  information 

It 

matrix  is  equal  to  minus  the  expected  Hessian  D  ,  but  this  is  not  in  general  true  for  the 
constrained  case. 


7.4  Estimation  of  Entropy 


For  decision  on  model  parametric  order  and  structure,  it  it  necessary  to  estimate  the 
negative  entropy  based  on  the  sample.  One  such  procedure  is  due  to  Akaike  (1973.  We 
consider  the  case  where  the  informative  sample  q  and  the  predictive  sample  r  are 
independent.  For  each  selection  of  a  parameter  subset  k  =(kp....,k  ),  the  Akaike 
information  criterion  for  comparing  the  maximum  likelihood  estimators  is 

AIC(k)  =  — 21ogp(q)0k(q))  +  2K(k) 

where  K  (k)  is  the  number  of  parameters,  i.e.  the  dimension  on  (fc.  The  minimum  AIC 

estimator  (MAICE),  denoted  0^(q),  is  0^(q)=^q)(q)  where  k(q)  is  the  parameter  set 
minimizing  AIC  (k)  is  an  unbiased  estimator  of  the  negative  entropy  (7-1)  based  upon  the 
informative  sample  and  the  assumed  model  structure.  The  predictive  sample  is  essentially 
replaced  by  the  information  sample,  and  the  term  2K(k)  is  an  adjustment  for  the  bias  due 
to  the  correlation  between  the  informative  sample  q  and  the  estimate  ^(q). 

Following  Akaike,  we  use  the  maximized  log  likelihood  t  (v  )  =  ^(q>v  (q))  as  an 
estimate  of  the  relative  entropy  and  compute  the  bias  in  the  procedure.  Consider  the 
expected  log  likelihood  difference  using  (7—3) 

E [1(0)  -  l(^) }  =  E \i(^)  -  l{0 k)]  +  E [t{0)  -  i(^)) 

=  -  E [/’(0k)(0k  -  ?)]  -  E[i(^  -  0k)'I7"(Sk)(0k  -  ’^)]  +  E [t(0)  -  f(?)] 

E[(0k  -  0k)Tf,(0k)((i?k  -  ^  -  ?)]  + 
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=  -t'Idim(«k)  +  R(’».^)  =  -dim  (0*)  +  K(0,?) 


where  the  third  equality  follows  using  Equation  (7—4)  which  is  satisfied  by  the  maximum 
likelihood  estimate,  using  the  asymptotic  equivalence  of  the  negative  Fisher  information 
and  the  Hessian  in  (7—7),  and  the  expression  (7—3)  for  the  negentropy.  Consider  the  case 
of  fitting  two  models  0^  and  0^,  and  consider  the  expected  difference  of  the  maximized  log 
likelihoods 


E[40)  -  l(^)\  =  Eft?)  -  l(&)\  -  E [l(0)  -  l(h\ 

=  +  dim(?)  -  dim(^)  +  R(0,^)  -  R(0,?) 


Thus  for  relative  comparisons  among  hypotheses  based  on  a  given  sample,  an  unbiased 
estimate  of  twice  the  negentropy  E [1(d)  —  t{^)]  is  given  by  the  Akaike  information 
criterion.  Note  that  the  proof  of  this  is  much  more  general  than  that  originally  given  by 
Akaike  (1973)  since  it  applies  to  the  general  case  of  comparisons  of  nonnested  structures. 
Also,  the  true  parameter  9  need  not  be  contained  in  the  structures  being  compared  so  long 
as  the  Fisher  information  matrix  is  a  constant  in  a  neighborhood  including  the  true 
parameter  and  its  projection  onto  the  subspaces  of  these  structures. 

7.5  Adaptation  to  Slow  Changes 

The  key  issue  in  adaptation  to  slow  changes  is  the  choice  of  the  rate  of  adaptation. 
Previous  approaches  to  slow  adaptation  have  been  largely  heuristic  and  not  based  upon 
sound  statistical  principles.  In  this  section  the  principles  and  concepts  of  predictive 
inference  are  used  to  derive  a  procedure  for  choosing  an  adaptation  rate  which  minimizes 
the  prediction  error  for  an  independent  sample  from  the  process. 


7-9 


Consider  the  problem  of  choosing  the  optimal  rate  at  which  to  identify  the  system, 
we  consider  the  problem  as  stated  in  Section  7.2  where  it  is  assumed  that  the  system  is 
slowly  changing,  and  that  based  upon  the  observed  data  we  wish  to  determine  a  best  rate 

I 

to  identify  the  system.  Consider  dividing  the  data  over  an  interval  of  L  =2  samples, 
where  ^is  an  integer,  into  H  =2^  subintervals  each  with  length  Suppose  for  a  given  h 

that  on  each  of  the  intervals  L  for  1,2,3,. ..2^  the  best  state  space  model  M^  j  is  determined 
using  the  minimum  AIC  estimate  (MAICE)  criterion, 


To  provide  some  motivation,  we  consider  the  negentropy  as  expressed  in  (7—3).  For 
the  case  of  a  constant  parameter  model,  consider  the  effect  of  the  number  of  the  parameters 
and  the  bias  in  choosing  too  low  an  order  model.  The  negative  entropy  depends  upon  the 
particular  parameter  estimation  procedure,  but  for  large  samples  it  is  bounded  from  below 
by 


2^ 

ER(0,^)  =  ^ll^— ^llnk  +  ER (oft)>  k(k)  +  E  ER  (0,^) 

2  U  Z  j=l  ’ 

where  j  is  a  interval  index.  This  lower  bound  is  accurate  for  0^  near  $  The  first  term 
is  the  sampling  variability  of  the  optimal  estimator.  The  second  term  is  the  bias  due  to 
constraining  the  parameter  estimates  0^  to  lie  in  the  subspace  which  increases  with 
increasing  sample  size  since  the  parameters  of  the  time  varying  process  are  not  constant. 


On  the  interval  L  we  consider  the  model  selection  procedures  for  h  =  1,...H  as 

above,  i.e.,  procedure  fits  the  composite  model  consisting  of  the  tin.e  invariant  models 

Mi  for  j  =1,2,3,.. .2^  fitted  on  each  of  the  2^  subintervals.  For  a  given  h  the  model 
“ij 

selection  procedure  is  identical  to  the  maximum  likelihood  problem  of  estimating  the 
parameters  0^  j  of  the  joint  models  j  for  j  =1,...,2*\  Then  for  a  given  h  the  MAICE  for 
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the  model  selection  procedure  is 

MAICE(Mh)  =  -2  log  p  (IL,0h)  +  2Kh(kh) 

2h 

=  £  [-2  log  p  (I  jt0h 

j=l 


*  _ ^ 

We  wish  to  choose  the  data  length  D  =L2  11  corresponding  to  the  h  minimizing 
MAICE(M,  ).  This  procedure  gives  a  very  sensitive  comparison  of  different  rates  2~^  for 
identifying  the  model,  or  equivalently  the  data  length  D  =2l  n  for  identifying  a  model. 

The  tradeoff  between  sampling  variability  form  using  too  small  a  data  length  and  bias  from 
using  too  long  a  data  length  is  seen  by  the  effect  of  data  length  on  the  two  terms  in  the 
MAICE  criterion.  Too  little  data  introduces  variability  in  the  log  likelihood  function  and  a 
penalty  for  more  parameters,  while  too  much  data  reduces  these  but  introduces  bias  in  the 
model  due  to  modeling  a  changing  process  as  one  with  constant  parameters.  The  optimum 
is  achieved  when  the  respective  rates  of  decrease  and  increase  are  balanced. 


7.6  Detecting  Model  Changes  Across  Different  Data  Sets 

In  Section  7.4  the  AIC  was  shown  to  give  an  unbiased  estimate  for  choosing  a  model 
structure  for  a  give  set  of  data.  In  this  section,  we  consider  the  problem  of  determining  if 
there  is  a  change  in  the  process  between  tow  different  data  sets.  The  detection  problem 
that  we  consider  is  where  the  process  is  modeled  as  a  slowly  changing  process  using  some 
efficient  procedure  such  as  given  in  Section  7.5.  Suppose  that  from  such  a  procedure  a 
model  is  given  for  an  interval  of  data  and  that  over  a  later  interval  of  data  set  Q2 
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second  model  is  fitted.  The  data  lengths  of  the  two  sets  are  generally  different  with  the 
second  set  usually  much  shorter.  In  the  context  of  the  scheme  in  Section  5  for  fitting 
optimal  fixed  parameter  models  over  many  intervals  of  various  lengths,  the  two  intervals 
we  wish  to  determine  if  there  has  been  a  significant  departure  in  the  process  characteristics 
between  the  two  data  sets. 

Since  the  model,  denoted  M^,  fitted  on  data  set  Q ^involves  a  near  optimal  selection 
of  the  data  length,  the  model  provides  a  best  prior  model  when  Q ^  is  chosen  as  the 
most  recent  optimal  length  interval  preceding  C^.  To  detect  any  abrupt  changes  in  the 
system,  we  wish  to  compare  on  the  joint  data  set  (QpQ2)  the  fit  of  the  composite  model 
(MpMg)  with  the  composite  model  taking  into  account  the  fact  that  the  models 

and  M2  are  fitted  over  the  respective  intervals  and  To  make  this  comparison, 
we  seek  an  unbiased  estimate  of  the  difference  between  the  negentropies  of  these  two 
composite  models. 

The  Markov  stucture  can  be  used  to  make  the  two  samples  essentially  independent 
by  conditioning  the  observations  on  the  past.  The  joint  distribution  of  the  two  data 
intervals  is 


p(QpQ2)  -  p(Q1)p(Q2I  Qj) 

Thus  we  suppose  that  the  models  on  each  of  the  two  sets  are  fitted  to  the 
conditional  data  given  the  past.  Since  the  model  on  the  first  data  set  is  the  same  for 
both  composite  hypotheses,  the  first  term  p(Q^)  is  the  same  for  both  composite  models. 
Thus  we  need  only  compare  the  negentropy  difference  between  the  conditional  model 
M^Qgl  Q^)  and  the  conditional  model  M2(Q2|  Q^)  both  fitted  on  the  interval  Q ^ 
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Denoting  the  parameter  estimates  of  the  models  and  Q0  by  0  and  O' 
respectively,  we  compare  the  log  likelihoods  on  the  random  variables  Q0  given  Q^.  The 
expected  difference  of  the  log  likelihoods  of  the  two  models  is 

-  [(e2)}  =  E [i(0)  -  t{02)\  -  E [1(0)  -  l(01)} 

=  -dim(^)  +  R (oft)  -  R (0,0l) 

where  the  term  dim (0  )  is  not  present  since  the  estimate  0  is  a  function  only  of  the  sample 
which  is  independent  of  the  sample  C^j  Q^.  Thus  an  unbiased  estimate  of  the 
difference  of  negentropies  R (0,(P)  of  the  two  models  is 

1(f)1) +  dim(^) 

This  gives  a  test  for  the  occurrence  of  an  abrupt  change  between  the  two  data 
intervals.  Depending  upon  the  nature  of  the  change  and  the  process  characteristics,  the  bet 
detection  interval  will  vary.  Some  changes  give  most  of  the  information  about  the  change 
over  a  short  interval  while  others  have  a  cumulative  effect  and  require  a  long  time  interval 
to  detect. 

7.7  Detection  of  Slow  Model  Changes 

We  now  consider  the  detection  of  slow,  essentially  continuous,  model  changes. 

What  we  wish  to  achieve  in  this  case  is  an  appropriate  data  length  over  which  to  fit 
models.  If  the  data  length  is  too  short,  then  there  will  be  a  tendency  to  over— fit  the  model 
to  the  noisy  data,  leading  to  larger  prediction  errors.  If  the  data  length  is  too  long,  then 
the  effects  of  parameter  variations  will  begin  to  dominate  the  prediction  errors. 
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In  order  to  generate  an  appropriate  measure  by  which  to  trade  off  these  two 
characteristics,  we  again  use  the  AIC  criterion,  but  in  a  different  way.  Assume  we  have 
data  over  a  time  interval  I  =  {1,2,. ...n}  and  suppose  we  divide  this  interval  into 
subintervals  of  length  W: 


IpW . W} 

I2  =  {W+l,  W+2,  ...,  2W}, 


etc. 

Then  an  appropriate  measure  for  data  length  determination  is  the  average  per  sample 
entropy.  In  terms  of  the  AIC  criterion  we  define 

* 

ATUW  =  _i_  E  i  AIC  (kj) 
w  Nw  i  W  P  1 

where  AIC  (kj*)  is  the  minimum  prediction  AIC  for  the  i  interval  L,  kj*  is  the  optimal 
model  order  for  the  t  interval,  and  Nw  is  the  number  of  intervals  of  W  over  the  whole 
data  interval  I.  The  prediction  AIC  uses  the  forward  prediction  error  variance  (cf  eq.  5.9)) 
rather  than  the  fit  error  variance,  since  we  are  interested  in  the  error  over  the  next 
interval,  not  the  one  over  which  the  model  was  fitted.  This  has  the  effect  of  increasing  the 
penalty  on  the  number  of  parameters  in  the  AIC  criterion. 

The  form  of  AIC  is 
P 

AIC^*)  =  AIC(ki*)  +  M(kj*) 
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Experimental  Results 

In  order  to  test  this  criterion  as  the  basis  for  data  length  selection,  we  considered  a 
second-order  AR  model 

y(t)  =  ax(t)  y(t-l)  +  a2(t)  y(t-2)  4-  n(t) 

where  n(t)  was  zero— mean  white  gaussian  noise  with  unit  variance.  The  time  varying 
coefficients  were  selected  so  that  the  two  system  poles  were  on  the  unit  circle  in  the 
z-plane.  This  yields:  a,(t)  =  2  cos  0( t),  a2  =  —1  where  the  roots  are:  cos  0( t)  -  i  sin  0(t) 
and  cos  0(t)  —  i  sin  0(t).  The  time— variation  of  0( t)  was  selected  as  0(t)  =  0(0)  +  2  r  f  t, 
where  f  is  a  selected  frequency.  Two  values  of  f  (.0001,  .001)  were  used  in  the  experiments. 
Total  data  length  was  1000  time  points.  The  results  for  Case  1  (f  =  0.001)  are  shown  in 
Table  7.1,  using  0(0)  =  0.2.  The  result  is  that  the  optimal  indicated  data  length  is  10-12 
samples  and  corresponds  to  the  case  in  which  the  average  coefficient  change  over  the  fit 
window  is  in  the  range  of  0.80  —  0.96.  Over  the  entire  data  length  of  1000  samples,  the 
value  of  a^  starts  at  1.64,  deceases  to  1.90  at  t  =  400  and  then  increases  to  1.90  at  t  = 

1000.  Thus,  the  average  coefficient  change  over  the  optimum  data  length  is  generally  more 
than  40%  of  the  coefficient  value. 

Table  2  shows  the  results  for  Case  2  (f  =  0.0001)  in  which  the  optimal  data  length  is 
found  to  be  30  samples.  This  is,  of  course,  increased  over  that  of  Case  1  since  the 
coefficients  vary  much  less  rapidly  —  on  the  order  of  0.019,  on  the  average.  Note  that  the 
rms  prediction  error  aQ  evaluated  over  the  fit  set  generally  decreases  monotonically  with 
data  length  and  cannot  be  used  as  a  selection  criterion. 
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CHARTER  8 


E— M  Algorithm  for  Adaptive  Time  Series  Analysis 

In  this  chapter,  we  explain  the  basic  properties  of  the  E-M  Algoritnm  based  on 
Dempster,  Laird  and  Rubin  (1977)  and  describe  its  application  to  Adaptive  Time  Series 
Analysis  The  E— M  Algorithm  involving  iteration  of  E  (Estimation)  and  M 
(Maximization)  steps  is  a  general  procedure  for  maximum  likelihood  estimation  (MLE)  of 
models  from  incomplete  data.  We  show  that  CVA— Regression  algorithm  of  previous 
chapters  implements  E  and  M  steps  in  a  special  way.  Based  on  the  recognition  of  this  fact, 
we  show  how  the  CVA— Regression  approach  can  be  generalized  to  obtain  MLE  of  the  state 
space  model.  In  addition,  we  show  how  the  algorithm  can  be  implemented  recursively  to 
allow  for  real-time  identification  and  extension  to  time— varying  systems.  We  also  consider 
extensions  to  missing  data,  nongauss  iamstatistics  and  ARM  A  models. 

8.1  E— M  algorithm  —  Basic  Properties: 

The  basic  motivation  for  the  E— M  algorithm  comes  from  the  fact  that  in  numerous 
estimation  problems,  only  partial  observations  of  all  the  states  or  underlying  causal  factors 
are  available.  If  the  observations  of  the  complete  state  or  factors  were  available,  the 
estimation  problem  would  become  simple.  The  E— M  algorithm  estimates  the  complete 
state  given  the  observed  data  and  then  estimates  parameters  after  construction  of  the 
complete  data  set.  This  process  is  repeated  in  such  a  way  that  the  likelihood  function 
increases  with  each  iteration  of  the  E— M  algorithm,  till  convergence  to  a  local  or  global 
maximum  of  the  likelihood  function  is  achieved. 

The  simplest  example  of  the  application  of  the  E— M  algorithm  is  for  the  case  of 
missing  data.  The  E— step  involves  estimating  the  missing  data  points  and  the  M  step  is 
based  on  the  procedure  used  when  there  is  no  missing  data.  Regression  models  with 
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missing  data  points  can  be  estimated  in  this  fashion. 

The  Markovian  modeling  approach  to  Time  Series  Analysis  lends  itself  to  the  E— M 
approach.  In  state  space  models  for  Markov  processes,  all  the  states  are  generally  not 
observed.  It  is  well  known  that  if  all  the  states  are  observed,  the  system  identification 
problem  is  solved  easily  by  Regression  or  Ordinary  Least  Squares  (CLS).  For  partial  state 
observations,  estimation  of  the  full  state  using  a  Kalman  Filter  requires  knowledge  of  the 
parameters.  A  natural  approach  based  on  the  E— M  algorithm  is  to  assume  the  parameters, 
estimate  states  and  update  the  parameters  using  the  estimated  states.  Even  though  the 
concept  is  simple  and  has  been  proposed  earlier  in  the  literature  of  system  identification, 
both  the  E  and  M  steps  must  be  carried  out  properly  so  that  the  likelihood  function 
increases  at  each  iteration  and  the  parameters  converge  to  the  ML  estimates.  Therefore 
the  conditions  under  which  the  E— M  algorithm  converges  must  be  examined  carefully. 

We  state  here  the  basic  results  from  Dempster,  Laird  &  Rubin  (1977)  on  the 
application  and  convergence  of  the  E— M  algorithm.  We,  then,  apply  these  results  to  the 
problem  of  state  space  model  identification  using  CVA  and  E— M  algorithms. 

Assume  two  sample  spaces  S  and  Y  and  a  many— to— one  mapping  from  S  to  Y.  The 
observed  data  y  is  a  realization  from  Y.  The  corresponding  state  s  in  S  is  not  observed 
directly,  but  only  indirectly  through  y.  The  sampling  density  defined  for  all  s  in  S  is 
f(s|<J)),  where  <f>  denotes  a  parameter  vector.  The  corresponding  sampling  density  for  y  is  Y 
is  obtained  as 


s(y  I4>)  =  JS(y)f(s)l<t>)ds  (8.1) 

The  EM  algorithm  attempts  to  find  (J)  which  maximizes  g(y|(j>)  given  y  by  making 
an  essential  use  of  the  density  f(s|<j)). 
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One  of  the  kev  relationshio  used  in  the  derivation  of  the  E— M 


algorithm  :s 


=  LLLLH—  (5-2) 

P  l  s  i  y  ,  «t>  ) 

where  p(s|v,(J>)  denotes  conditional  distribution  of  s  given  y  and  $.  Eq.  (8.2)  follows 
from  the  fact  that 


g(y  1 4>)  P(s|y,4>)  =  p(s,y|cj>) 

and  p(s,y (4>)  =  f(s|<j>) 

since  y  is  a  subset  of  s.  Taking  logs  on  both  sides  of  Eq.  (8.2),  we  obtain  the 
log-likelihood  function 

L(4>)  =log  g(y|<j>) 

=  log  f(s  |  (j))  —  log  p(s  |  y,4>)  (8.3) 

For  the  case  of  exponential  family  of  probability  distributions,  which  includes  the 
gaussion  case,  Eq.  (8.3)  simplifies  greatly  and  leads  to  very  useful  results.  We  first  note 
that  an  exponential  distribution  can  be  written  in  the  form, 

f(s  |  <}>)  =  b(s)  exp  (<(>  t(s)T)/a(<j>)  (8.4) 

where  t(s)  is  a  1  x  r  row  vector  of  sufficient  statistics  for  the  complete  data.  The  1 
x  r  row  vector  parameterization  $  is  unique  up  to  an  arbitrary  non— singular  rxr  linear 
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transformation,  as  is  the  corresponding  choice  of  t(s).  Based  on  a  property  of  the 
exponential  distributions.  p(sjy,$)  is  also  exponential  and  has  the  form 

p(s  |  y,<j))  =  b(s)  exp  (c^t(s)T)/ a(4:(y))  (S.5) 

where  a(<}>(y))  =  /  b(s)  exp  (<|>t(s)^)ds  (8.6) 

S(y) 

The  key  property  here  is  that  both  f(s  1 4>)  and  p(s  j  y ,4>)  are  from  the  same 
exponential  family  with  the  same  natural  parameters  (j>  and  the  same  sufficient  statistics 
t(s),  but  are  defined  over  different  sample  spaces  S  and  s ( y > . 

Eq.  (8.3)  can  now  be  written  as 

L(4>)  =  -  log  a(<j>)  +  log  a(<Ky))  (8.7) 

The  first  partial  derivative  of  L((j>)  or  the  score  function  is  obtained  from  Eq.  (8.6)  &;  (8.7) 
as 


3L(<I>)  =-E(t(s)|<|>)  +  E(t(s)|y,j>)  (8.8) 

where  E(.)  denotes  expectation  over  the  appropriate  sample  spaces  viz.  S  and  S(y) 
respectively  in  Eq.  (8.8). 

T 

Since  u  =  0  is  a  necessary  condition  at  the  maximum  of  the  likelihood  function, 
the  MLE  <j>  must  satisfy  the  condition. 
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E  (t(s);$)  =  E  (i(5)|v 4) 


The  E— M  algorithm  achieves  (S.9)  iteratively  as  follows: 


E— Step:  Estimate  t(s)  during  kth  iteration  as, 


iW(s)  =  E(t(s)|y/0) 


(8.10) 


M— Step:  Obtain  updated  parameter  vector,  t^k+^(  by  solving  the  eqn. 


E(t(s)|(M  =  ;(k)(s; 


(8.11) 


The  E— M  algorithm  of  Eq.  (8.10)  and  (8.11)  defines  a  parameter  mapping  (|) 


(k) 


<t>(k+1)  as 


<j>(k+1)  =  M  (<J)W) 


(8.12) 


This  mapping  has  several  interesting  properties  which  are  given  in  Dempster,  Laird 
k  Rubin  (1977)  and  Wu  (1983).  In  particular,  it  leads  to  a  monotonic  increase  in  the 
likelihood  function  L((J>).  The  conditions  under  which  the  parameter  sequences  (j/k^ 
converges  to  the  MLE  <f>  and  the  rates  of  convergence  are  given  in  the  above  papers.  These 
conditions  are  verifiable  for  exponential  family.  The  Jacobian  of  the  mapping  (8.12)  is 
given  by  the  expression. 


=  V  (t|y,+)  V(t|+)-l 


(8.13) 


8-5 


where  V(.)  denotes  covariance  operator.  The  rate  of  convergence  is  determined  by  the 
eigenvalues  of  the  above  Jacobian  which  in  turn  depend  on  the  information  loss  due  to 
incompleteness  of  the  data. 


8.2  MLE  of  State  Space  Model  Parameters  Using  E— M  Algorithm: 

Consider  a  state  space  model  in  the  usual  notation: 

x(k  +  1)  =  F  x(k)  +  G  w  (k)  (8.14) 

y(k)  =  H  x  (k)  +  u(k)  (8.15) 


k  =  1,2, . N 


Let  (J>  denote  all  unknown  parameters  in  the  above  model.  It  is  required  to  estimate 
$  given  output  data,  =  {y(l),...y(N)}. 


It  is  obvious  from  Eq.  (8.14)  and  (8.15)  that  if  the  full  state  Wk)}^^  ^  was 
known,  matrices  F,G  and  H  could  be  estimated  using  Regression  or  OLS.  We,  therefore, 
define  the  complete  data  set  as: 

sjf  =  {  Y  N,  XN}  (8.16) 

■L  1  1 

The  complete  —  data  density  function  is 


f(SiN|.|>)  =  f(YN,x”|« 
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=  jj-  f( Y(j) I X(j),<fr)(X(j) I  x(j— 1):4,) 
J=1 


(8.17) 


Using  Eq.  (8.14)  and  (8.15),  we  can  write 


f(y(j)|x(j),<j>)  -  N(Hx(j),R) 


(8.18) 


f(x(j)[x(j-l),(j,)-N(Fx(j-l),  GQG1) 


(8.19) 


N  \ 

We  now  try  to  express  f(S  J<j))  in  the  form  (8.4)  to  identify  t(Sx  )  and  4>. 


f(»= 


'2  7r )  "sti'r 


N  2 

exp  -  1/2  {  S  |  |y(j)-Hx(j)|  |  , 

j=l  R  1 


+  I  |x(j)  —  Fx(j— 1)  1 1  (GQGT)— lj 


(8.20) 


The  term  outside  the  exponent  represents  b/a((j>).  The  term  in  the  exponent  within 
N  T 

brackets  represents  (j)  t(S  )  as  follows: 


*  t(  N)T  =  Tr{R— 1 S  (y(j)  -  Hx(j))  (y(j)  -  Hx(j))T} 


+  Tr  (GQGTr1  S  {(x(j)  -  Fx(j-l))(x(j)  -Fx(j-1))T} 
i=l  / 


(8.21) 


The  next  step  is  to  redefine  the  unknown  parameters  in  such  a  way  that  they  appear 


linearly  in  the  above  equation.  The  terms  multiplying  a  given  parameter,  then,  define  the 

associated  sufficient  statistic  for  the  estimation  of  that  parameter.  As  an  example,  consider 

7 

the  case  in  which  only  the  noise  covariances  R  and  GOG  are  unknown. 

This  is  a  very  common  case  in  Adaptive  Kalman  Filtering  and  correlation  type 

methods  for  the  estimation  of  these  matrices  were  considered  in  Mehra  (1970).  E— M 

algorithm  leads  to  a  new  method  for  the  MLE  of  noise  covariances,  which  should  have 

great  practical  significance.  Based  on  Eq.  (8.21),  it  is  better  to  estimate  R  1  and 
T  —1 

(GQG  )  .  The  sufficient  statistics  are: 

_N  _L 

*17=1  <(y(j)-H*(j)  (y(j)  -h  x(jf }  *or  R 

This  is  intuitively  obvious  since  the  above  quantities  are  sample  covariances  of  v 

/ — ■ 

La 

and  gw. 

t 

Before  proceeding  to  the  more  general  case  of  additional  unknowns  {F,H},  we  derive 
the  E-M  algorithm  for  the  above  case. 

E-Step: 

t(ik>(s(f)  =  E[t1(SN)|yN,^k)] 

=  S  {(y(j)  -  Hx(j  |  N))  (y(j)  -  Hx(j|  N))T 

j=l 

+  H  P  { j |  n}  HT}  (8.22) 


8-8 


*  y  1 , 

where  x(j;  X)  denotes  the  smoothed  estimate  of  x(j)  based  on  Y  ‘  and  ^  *.  The 
associated  covariance  is  denoted  by  P(j|  X). 

The  equations  for  the  computation  of  f  (j|  X)  and  P(j  |n)  are  well  known  in  the 

control  literature.  See  (Bryson  and  Ho  (1975)).  The  computations  can  be  carried  out 

recursively  using  a  Kalman  Filter  and  a  backward  sweep.  Alternatively,  x(j|N)  can  be 

written  as  a  weighted  average  of  a  forward  Kalman  filter  state  estimate  x(j|  Y-*)  and  a 

N 

backward  Kalman  filter  estimate  x.(j |  Y_|_  ),  (see  Mehra  (1968)). 

Similarly  t  is  estimated  as 

N 

t  (k)(SN)  =  S  (x(j|  N)  —  Fx(j — 1 1 N — 1))  (x(j|N)-Fx(j-l|N-l))T+  P(j|N)  + 

2  ‘  j=t 

FP  (j-1 1 N)  PT  -  CN(j,j — 1)  FT  -  F  C  Jj(iH)  (8.23) 

where  CN(j,j— 1)  denotes  the  correlation  between  the  smoothing  errors  at  j  and  j-1. 
An  expression  for  C^(j,j— 1)  can  be  found  from  the  smoothing  equations. 

The  E— step,  therefore,  consists  of  running  two  Kalman  filters  or  a  filter/smoother 
and  solving  Eq.  (8.22)  and  (8.23).  The  calculations  can  be  made  recursive  in  data  length 
N. 

M-Step:  For  this  step,  we  need  to  evaluate  E(t  |<J>)  and  ECt^j (j>),  expressing  them  as 
functions  of  <J>,  equating  them  to  the  values  for  t  ^  and  obtained  in  the  E— step  and 
solving  for  <j).  This  is  quite  straight-forward  based  on  the  state 

N  rp 

E(t  |<j>)=  S  E[v(j)  v 1  (j)j  =  NR  (8.23) 

1  j=i 
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X 

E  (t  ;<?)=£  E  iGw(j)\v 

:  J=1 


T(j)GT]  =  X  GOG1 


Therefore, 


R(k+l)  =  i  £  {(y(j)  —  Hx(j|  N))  (y(j)  —  Hx(j|  X) 
X  j=l 


+  H  P  (j | N)  HT}  (8.25) 

(GOGT/k+l)  =i  S{(x(j|N)-Fx(j-l)N)). 

N 

(x(j |  N)  —  Fx(j— 1 1  N))T  +  P(j|N)  +  F  P  ( j — 1 1 N)  FT 

-  CN  FT  -  F  cj  (j,  j-1)}  (8.26) 

We  now  consider  the  case  of  unknown  {F,R}.  In  this  case,  the  estimation  of  R 
remains  unchanged,  but  the  sufficient  statistics  for  the  estimation  of  F  are 

N  T  N  T 

t-  =  £  x(j)  xi(j-l)  and  t4  =  E  x(j)x  (j). 

6  j=l  j=1 

B(t3l«  =  NFExx 


E  (t4|+)  =  N 


where  E^  is  the  covariance  of  x.(j). 
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Using  the  E— step. 


-(k)  X  .  -T 

where  t  =  S  {xQINJx1  (j— 1 1 N)  +  CN(j,j-l)} 

13  j=l  * 

N 

«ik)  =  s  {x(j|N)xT(j|N)  +  P(j|N)} 

j=l 

T 

The  above  results  are  easily  generalized  to  the  case  of  unknown  {F,H,R,  GQG  }. 
However,  identifiability  issues  must  be  considered.  In  particular,  F  and  H  must  be  put  in  a 
canonical  form.  If  output  canonical  form  is  used,  H  consist  of  zeros  and  ones  and  F  has  no 
more  than  np  parameters  where  n  is  the  state  dimension  and  p  is  the  output  y  dimension. 

8.2.1  Relationship  of  E— M  algorithm  to  direct  MLE: 

Direct  MLE  of  parameters  <j>  is  given  in  Appendix  B  involves  maximization  of 
L(4>)  =  log  p  (Y^l^) 

The  computational  aspects  are  discussed  in  Gupta  and  Mehra  (1974).  It  is  interesting  to 
note  the  similarities  and  differences  between  the  two  approaches.  In  particular,  if  the 
smoothing  form  of  L(<j>)  given  in  Schweppe  (1973)  is  used,  the  similarities  are  quite  striking. 

dx 

The  main  difference  is  that  the  Kalman  Filter/Smoother  sensitivity  equations  involving 
and  d?/d$  do  not  have  to  solved,  resulting  in  significant  computational  savings.  On  the 
other  hand,  the  rate  of  convergence  of  the  Gauss-Newton  iteration  is  probably  faster, 
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though  the  convergence  may  be  more  problematic,  especially  where  some  parameters  are 
unidentiable  resulting  in  a  singular  Fisher  Information  Matrix.  A  comparison  of  the  two 
methods  on  practical  problems  would  be  of  great  interest. 

8.3  E— M  Algorithm  for  ARMA  models: 

ARMA  (p,q)  model  has  the  form 

P  q 

y(t)  =  S  a-  y(t — i)  +  u(t)  +  t  b-  u(t-i)  (8.28) 

i=l  1  i=l  1 

t  =  1,2, . N 


where  {u(t)}  represents  a  random  shock  series  which  is  zero  mean,  Gaussian  and 
white  with  variance  Parameter  vector  <j>  is  a  (p+q+1)  vector  of  unknowns  (a  ,  ..ap, 

bi’"'V  °2*y 

N 

We  define  the  complete  data  set  as  consisting  of  the  sequence  ={u(t),  t=l,  N} 
N  N 

since  given  U  ,  the  observed  data  set  Y  can  be  constructed  exactly  from  Eq.  (8.28).  The 

N  N 

sequence  U  ^  is  the  innovation  sequence  and  is  related  to  Y  ^  through  a  causal  and  causally 

invertible  transfer  function.  It  is  assumed  that  N  is  large  so  that  the  effect  of  initial 

conditons  (i.e.  values  of  y(t)  and  u(t)  for  t<o)  is  negligible. 


P(UX >)=  T  P(u(t)|<J>) 
1  t  =  l 


(8.29) 


where  p(u(t)|<j>) ^N(o,^). 


u(t)  can  be  obtained  from  Eq.  (8.28)  as 
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iS-40) 


u(0  =  y(0  -  -  -  bj  u(t-i) 

i=l  i=l 


Using  this  equation. 

P(U’|«-J _ ,.  exp 

( 2  5T<ru  )N 


-  S  b-u((t — i)] 2 
i=l  1 


N  -1  • 

The  sufficient  statistics  t(Ux  )  for  estimation  of  parameters  \a  ...a  b  ...b  ,  a2}  are 

v  r  F  1  i  P,  i  q’  u; 

N  N  N  N  N 

S  y(t-i)  u(t-j)  S  y2(t),  S  y(t-i)  y(t-j),  S  y(t-i)  u(t-j)  and  £  u2(t).  In  order 

t=l  7  t=l  t=l  t=l  t=l 

to  obtain  conditional  mean  of  t(U^)  given  Y^,  u(t)  is  expressed  in  terms  of  y(t)  sequence 
by  inverting  Eq.  (8.28).  The  inversion  is  done  most  easily  by  using  lag  operator,  z. 


u(t)  =  A  j  2  j  y(t)  =  C(z)  y(t)  (8.31) 

bT^T 


P  ■ 

where  A(z)  =  1—  £  az 
i=l  11 


B(z)  =  1-  £  b-z1 
i=l  1 


C(z)  =  1-  £  C.z1 


C(z)  represents  the  equivalent  AR  model  of  a  high  order,  l. 


The  above  relationships  suggest  the  following  E— M  algorithm: 


i  £7 


u 


_  S  [y(t) 
2  t  =  l 


S  q.v(t-i) 
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E— Step:  Given  Y^  and  $'v!vl.  estimate  sequence  { u 1 1 ) }  from  Eq.  i  S - 3 1  ).  which 
corresponds  to  a  long  aritoregression.  The  order  of  the  AR  mode!  can  be  determined  using 
AIC. 

M— Step:  Update  (j>  parameters  {a^.-.a  b  ...b  ,  a*}  using  regression  of  Y(t)  on 
lagged  values  of  y(t)  and  u(t).  The  parameter  estimation  involves  use  of  the  same 
sufficient  statistics  as  identified  above. 


An  alternative  procedure  to  estimate  parameters  which  is  similar  to  the  CVA 

method  is  to  first  obtain  predictors  y(t+l  1 1),  y(t+2 1 1),.»  y(t+n|  t)  using  orthogonal 

.1 

projections  or  conditional  expectations,  where  n  =  max  (p,g).  Then  from  Eq.  (8.38), 


p  CL;  . 

y(t+n|t)=  S  a-  y(t+n— i  1 1)  (8.32) 

i=l 

a 

Equation  (8.32)  is  used  to  estimate  {d-}^  p. 


Then  {b-}  are  obtained  from 

B(z)  =  A(z)/C(z)  (8.33) 

o 

a ^  is  estimated  from  the  variance  of  the  residuals  from  the  long  autfregression  that 
estimates  coefficients  of  C(z). 

8.4  Relationship  between  CVA  —  Regression  and  E— M  Algorithm: 

The  procedure  used  in  section  8.3  for  ARMA  models  can  be  generalized  to  state 
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space  models  which  are  equivalent  to  multiple  AR.MA  models.  The  role  of  sequence  u^tl  is 
played  by  the  innovation  sequence  i/(t)  in  the  state  space  modeling  framework.  We  use  the 
Kalman  Filter  representation  of  the  state  model. 

x(k+l|k)  =  F[x(k | k — 1)  +  K  v  (k)]  (8.34) 

.  < 

y(k)  =  Hx(j|  k— 1)  +  Kk)  (8-35) 

It  is  well-known  that  the  innovation  sequence  (t/(k)}  and  the  output  sequence 
(y(k)}  are  related  through  a  causal  and  causally  invertible  transfer  function.  Therefore, 
given  Mk)}>  {y(k)}  can  be  obtained  from  Eq.  (8.34)  and  (8.35).  To  obtain  i/(k)  from  y(k), 
we  rewrite  Eqs.  (8.34)  and  (8.35)  as 

i(k+l|k)  =  Fx(k|  k— 1)  +  FK(y(k)-Hx(k|k-l)) 

=  F(I-KH)  x(k|  k— 1)  +  FKy(  k)  (8.36) 

Kk)  =  y(k)-Hx(k|k-1)  (8.37) 

Defining  the  complete  data  set  as  SN  =  {Kk)l  w>  the  E-M  can  be 

cu 

implemented  in  the  same  way  as  for  the  ARMA  case.  We  are  assuming  here  tht  N  is  large 
and  the  system  is  stable  so  that  the  effect  of  initial  state  x(o)  is  negligible. 

E— Step:  Given  and  parameter  values  in  {F,H,K},  use  Eq.  (8.36)  and  (8.37)  to 
estimate  the  sequence  iF.  At  the  same  time,  the  sequence  {x(k  |  k — 1 ) }  is  estimated  and 
the  sufficient  statistics  are  computed. 
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M— Step:  Update  parameters  in  {F.H.K}  by  equating  sufficient  statistics  from  the 
i  E-step  to  their  expected  values,  which  are  only  a  function  of  §.  Alternatively,  perform 

regressions  using  Eq.  (8.34)  and  (8.35)  and  treating  //(k)  as  white  noise. 

Notice  that  the  regression  step  of  the  CVA  algorithm  of  the  previous  chapters  is 
similar  to  the  above  M— step.  The  major  difference  lies  in  the  E— step,  where  the  CVA 
algorithm  estimates  x(k|k— 1)  nonparametically  assuming  no  a  priori  model  structure.  In 
practice,  this  is  a  necessary  first  step  since  it  provides  model  structure  and  initial 
parameter  estimates. 

The  alternative  procedure  described  in  section  (8.3)  for  implementing  M— step  has 
been  generalized  by  Akaike  (1976)  and  Mehra  (1982)  to  the  current  situation.  The 
covariance  of  i/(k)  is  obtained  by  fitting  a  long  auto-regressive  model.  The  F-parameters 
are  obtained  from  the  CVA  calculation  after  model  order  has  been  determined.  K  matrix 
is  obtained  by  equating  the  transfer  functions  from  the  AR  model  and  the  state  space  eqn. 
model  (8.34)  and  8.35). 

The  above  relationship  between  CVA  and  E— M  algorithm  shows  that  CVA  only 
implements  one  step  of  the  E— M  algorithm.  By  repeating  these  steps,  as  outlined  above, 
one  can  increase  the  likelihood  function  and  achieve  convergence  to  MLE.  Furthermore, 
the  recognition  of  the  above  relationships  shows  us  how  to  make  CVA  recursive. 

It  should  be  remarked  that  the  results  of  the  sections  8.2,  8.3  and  8.4  are  extended 
easily  to  the  case  of  exogenous  inputs  or  forcing  functions  which  are  known. 

8.5  Recursive  ML  Identification: 
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8.5.1  ARM  A  Case: 


In  this  section,  we  show  how  the  E  and  M  steps  can  be  implemented  recursively  for 

adaptive  time  series  analysis.  For  simplicity,  consider  first  the  ARMA  model  of  Eq.  (8.2S). 

The  recursive  estimation  of  AR  models  is  well-known  (Ljung  and  Soderstrom  (1983)).  The 

computations  of  {u(t)}  from  (y(t)}  can  be  made  recursive  both  in  model  order  and  data 

length  for  the  univariate  as  well  as  the  multivariate  cases.  In  order  to  perform  recursively 

M— step  involving  regression,  we  create  a  new  state  vector  <j)(t)  consisting  of  {a^.a^, 

b  ,...b  }.  For  constant  coefficient  ARMA  models, 
i  Q 

<Kt+l)  =  <j>(t)  (8-38) 

y(t)  =  H(t)  <j)(t)  +  u(t)  (8.39) 

where  H(t)  =  [y(t-l),  y(t-  ),  u(t-l),  u(t-q)] 

On-line  estimation  of  (j>(t)  can  be  performed  recursively  using  a  Kalman  Filter.  The 
variance  parameter  a ^  can  also  be  updated  recursively.  We  can  generalize  to  the  time 
varying  parameter  case  in  which 

<J>(t+l)  =  A  <J>  (t)  +  w(t)  (8.40) 

If  A  and  Cov(w)  are  known,  a  Kalman  Filter  still  provides  the  best  estimates  of 
(j)(t),  along  with  the  covariance  of  the  estimates.  For  A  and  Cov(w)  unknown^, 

E— M  algorithm  needs  generalization.  This  can  be  done  by  defining  A  and  Cov(w)  as  the 
hyper— parameters  and  redefining  the  complete  data  vector  to  consist  of  {u(  ),  <{>(t)} 
sequence. 
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The  recursive  MLE  computations  involve  the  following  steps,  when  a  new  data 
point  y(X+l)  is  received. 

1.  Update  AR  model  parameters  (coefficients  of  C(z))  by  using  y(X+l). 

2.  Recompute  {u(t)},  t=l,  N+l  using  the  new  AR  model. 

3.  Solve  the  KF  equations  for  the  parameter  state  vector  <j>(t),  using  the  new  values 
of  {u(t)}  and  {y(t)},  t=l,  N+l. 

4.  Solve  for  AR  model  parameters  from  the  estimated  <f>  parameters  and  repeat  the 
above  steps. 

In  the  above  procedure,  steps  2  and  3  are  repeated  over  the  whole  data  set.  In  cases 
where  the  parameter  changes  are  small,  it  may  be  sufficient  to  simply  compute  u(N+l)  and 
run  the  KF  only  for  one  step,  without  any  further  iterations.  This  procedure  is 
recommended  in  any  case  to  check  if  the  changes  in  the  parameters  are  large  enough  to 
warrant  recomputation  of  the  whole  (u(t)}  sequence  and  iteration. 

For  the  time  varying  case,  it  is  also  possible  to  implement  a  fixed  data  window  KF 
(see  Mahmood  (1989)).  The  optimal  window  length  can  be  determined  using  a  generalized 
AIC  criterion. 

8.5.2  State  Space  Models: 

The  procedure  for  recursive  estimation  in  state  space  models  is  similar  to  the  one  used 
above  for  ARM  A  models.  For  the  case  in  which  the  model  structure  is  unknown,  CVA  is 
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used  first  to  determine  model  order  and  a  system  state  vector.  Then  matrices  F.H.K  and 
are  determined  using  methods  proposed  by  Akaike  (1974)  and  Mehra  (1982).  In  this 
approach  F  an_H  are  chosen  in  a  canonical  form  with  a  parsimonious  representation.  This 
initial  step  is  nonrecursive  and  requires  a  batch  of  data  {y(l),..y(  ))}.  After  this  step,  the 
algorithm  can  be  made  recursive.  It  is  desirable  to  supplement  CYA  procedure  with  an 
E— M  iteration  on  the  initial  batch  of  data  to  get  MLE.  Then  the  purpose  of  the  recursive 
algorithm  will  be  to  update  parameters  with  a  new  data  point  (N+l)  without  repeating  all 

i 

the  previous  computations. 

\ 

The  recursion  consists  of  the  following  steps: 


1.  Solve  Eq.  (8.36)  and  (8.37)  using  parameters  from  CVA  and  store 
{x(k|k— l),Kk)}k=1)N 


y  _  [7  S'.".  , 

2.  Rewrite  Eq.  (8.34)  and  (8.35)  as  regression  equations  with  (f>  as  dependent 

.  cvs  cb(.r4^x/''-‘-vv '  rS. — £-0  ,  /  ~ 

variable  {x(k+l  |  k),  y(k)}  and  independent  variables.x(k | k— 1).  Define  a  state  equation  for 
(j)  as  (8.48)  and  use  a  Kalman  Filter  to  estimate  (j)  using  the  regression  equations  as  the 
measurement  equations  for  <j>.  We  omit  the  details  since  the  procedure  is  similar  to  the  one 
used  for  the  ARMA  modeHn  writing  Eq.  (8.39)rami'Tfhe ^alman  Filter  equations  are 
well-known. 


3.  Repeat  the  above  steps  till  convergence  is  achieved. 

Depending  on  the  change  in  the  parameter  estimates  for  applying  the  steps  1  and  2 
to^N+1),  a  decision  can  be  made  to  implement  steps  1  and  2  on  the  entire  data  set  or  a 
suitable  window.  Similarly  step  3  should  be  used  if  the  changes  in  parameters  is  large. 
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The  major  beneGt  of  using  the  above  procedure  aw  repeated  use  of  CYA  is  that  the 
time  consuming  CYA  and  complete  regression  calculations  do  not  have  to  be  repeated  at 
each  step. 

8.6  Other  Extensions: 

8.6.1  Missing  Data: 

The  E-M  algorithm  is  ideally  suited  for  handling  missing  data  points.  The  complete  data 

o 

set  is  simply  expanded  by  the  missing  y(.)  data  pints.  If  an  initial  model  is  available,  and 
the  E-step  is  carried  out  using  a  KF,  the  missing  data  points  presents  no  problem  as  such 
since  the  KF  can  handle  data  points  taken  at  arbitrary  times.  In  fact,  the  KF  produces 
state  estimates  at  every  step  of  the  state  transition  equation.  The  missing  outputs  are  then 
estimated  from  the  corresponding  state  estimates  and  used  in  computing  the  sufficient 
statistics.  The  M-step  is  based  on  the  use  of  the  estimated  complete-data  sufficient 
statistics  in  a  regression  or  other  type  of  parameter  estimation  process.  As  the  E— M  steps 
are  repeated,  the  missing  data  points  are  replaced  by  their  best  estimates  based  on  the 
observed  data  and  the  model  parameters.  The  same  procedure  can  also  be  used  to  handle 
outliers  or  bad  data  points. 

8.6.2  Nongaussin  Statistics: 

E-M  algorithm,  like  MLE,  is  applicable  to  the  nongaussin  case,  as  long  as  the  form  of  the 
distribution  function  or  the  likelihood  function  are  known  except  for  the  unknown 
parameters.  However,  for  the  nonexponential  family  of  distributions,  the  M— step  involves 
maximization  of  a  function.  The  details  of  the  M-step  and  its  properties  are  well  covered 
in  Dempster  etal.  (1977). 
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CHAPTER  9 


SUMMARY,  CONCLUSION  AND  FUTURE  RECOMMENDATIONS 


9.1  INTRODUCTION: 


The  main  goal  of  this  project  was  to  investigate 
theoretical  issues  about  Adaptive  Time  Series  Analysis 
including  a  system  identification  technique  known  as  CVA- 
AIC .  A  predictive  inference  and  entropy  framework  was 
selected  for  the  analysis.  The  CVA-AIC  technique  was  also 
be  compared  with  other  techniques  of  system  identification. 

This  technique,  developed  over  the  last  decade,  has 
been  found  successful  in  many  practical  applications.  Such 
success  can  be  attributed  to  the  fact  that  CVA-AIC  has 
better  statistical  properties  than  many  other  existing 
techniques  and  allows  automatic  optimal  model  order 
selection  using  concepts  of  entropy  and  predictive 
inference.  The  latter  capability  relieves  a  user  from  the 
burden  of  applying  subjective  decisions  regarding  model 
order  selection.  The  CVA-AIC  technique  is  based  upon  the 
stochastic  realization  theory,  statistical  theory  of 
canonical  correlation  analysis  and  the  order  selection 
procedure  based  upon  Akaike's  Information  Criterion  (AIC) . 
Although  these  underlying  theories  are  based  upon  rigorous 
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mathematical  justification,  various  specific  steps  of  the 
CVA-AIC  technique  have  not  been  founded  upon  strict 
mathematical  rigor.  Instead,  the  technique  has  been  used  on 
an  adhoc  basis  in  practical  applications.  For  this  reason, 
the  effort  of  this  project  was  devoted  to  the  strengthening 
of  theoretical  aspects  of  the  CVA-AIC  technique  and 
exploring  its  relationship  to  other  techniques,  instead  of 
producing  empirical  results  using  extensive  simulation  runs. 
Before  presenting  the  conclusions  of  this  effort,  we  provide 
a  summary  of  the  report. 

9.2  SUMMARY: 

The  report  start  with  an  overview  of  the  adaptive  time 
series  problems.  The  necessary  mathematical  preliminaries 
have  been  presented  in  Chapter  2 .  It  has  been  shown  at  the 
beginning  of  this  chapter  (Section  2.3),  that  if  the 
dimension  of  the  parameter  vector  is  known,  then  the  AIC 
criterion  is  asymptotically  equivalent  to  the  maximum 
likelihood  estimation  problem.  In  the  second  half  of  this 
chapter,  the  estimation  technique  for  model  entropy  has  been 
presented.  These  results  are  not  new,  but  have  been 
presented  here  for  the  sake  of  completeness  of  the  report. 
Moreover,  the  analysis  has  been  carried  out  in  a  discrete 
valued  random  variable  framework  which  is  new  and  more 
transparent.  The  CVA  theory,  in  its  most  general  form,  has 
been  presented  in  Chapter  3.  Here  the  case  of  modelling 
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stationary  processes  with  possibly  singular  covariance 
matrices  has  been  considered.  It  has  been  shown  that  this 
type  of  problem  can  be  solved  using  a  generalized  singular 
value  decomposition  process  leading  to  a  generalized 
canonical  variate  theory.  The  analysis  in  this  chapter 
includes  the  standard  CVA  theory  when  the  respective 
covariance  matrices  are  of  full  rank.  Several  schemes  for 
generating  state  space  models  have  been  presented  in  Chapter 
4.  In  Section  4.1,  the  technique  for  computing  the  Kalman 
Filter  form  of  the  state  space  model  has  been  developed 
where  it  has  also  been  shown  how  to  transform  this  form  to 
the  standard  state  space  model.  The  technique  for  computing 
the  latter  model  has  been  presented  in  Section  4 . 2  and 
finally,  a  technique  that  is  recursive  in  model  order  has 
been  developed  for  the  standard  state  model  in  Section  4.3. 
The  motivation  behind  this  effort  is  that,  on  many 
occasions,  models  of  increasing  order  are  computed  until  a 
desired  characteristic  (such  as  a  desired  mode)  is  detected 
in  the  model.  The  problem  of  confidence  interval  estimation 
around  the  power  spectrum  and  systems  transfer  function  is 
dealt  with  in  Chapter  5.  The  motivation  for  undertaking 
this  analysis  is  that  both  the  power  spectra  as  well  as 
confidence  bands  are  needed  in  many  design  problems.  For 
example,  in  control  system  design,  a  control  law  is  designed 
for  a  nominal  plant  to  obtain  a  prespecified  performance 
level  and  stability  margin  that  will  also  domain  valid  for 
the  entire  set  of  plants  in  the  neighborhood  of  the  nominal 
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plant  in  the  frequency  remain.  Therefore,  the  uncertainty 
region  around  the  identified  model  must  lie  within  the 
region  allowed  by  the  control  law. '  it  is  interesting  to 
note  that  the  size  of  the  confidence  interval  around  the 
identified  model  can  be  adjusted  from  the  data  length. 

The  abrupt  change  in  a  model  is  encountered  frequently 
in  a  real  world  situation.  For  example,  sensor  failure  or  a 
component  failure  in  a  system  may  induce  an  abrupt  change  in 
the  system  model.  This  change  must  be  detected  quickly  and 
corrective  action  must  be  taken  to  prevent  any  catastrophic 
failure.  It  has  been  demonstrated  in  Chapter  6  that  CVA-AIC 
technique  can  be  used  for  such  fault  detection  by  comparing 
the  value  of  AIC  on  each  successive  data  interval.  The 
technique  developed  in  this  chapter  has  been  demonstrated  on 
a  simulation  example.  As  shown  in  Chapter  7,  an  entirely 
different  approach  is  taken  for  a  slowly  varying  system.  In 
this  case,  the  data  is  divided  into  various  subintervals  and 
a  separate  model  that  is  optimal  in  the  sense  of  AIC 
criterion  is  identified  for  each  segment.  In  the  last 
section  of  this  chapter,  the  technique  has  been  illustrated 
through  a  simulation  example. 

Chapter  8  is  devoted  to  the  application  of  a  technique 
known  as  the  E-M  algorithm  for  extension  of  the  previous 
results.  Although  the  technique  is  relatively  new  to  the 
engineering  community,  the  researchers  in  the  field  of 
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statistics  have  used  this  technique  from  the  mid  seventies. 
This  is  a  very  powerful  and  general  framework  in  which  many 
classes  of  estimation  problems  can'be  formulated  and  solved. 
In  addition  theoretical  issues  such  as  convergence  rates  can 
be  analyzed.  It  has  been  shown  in  this  chapter  how  the  CVA- 
AIC  technique  fits  naturally  into  the  E-M  framework  and  how 
the  maximum  likelihood  estimates  (MLE)  of  the  parameters  can 
be  obtained  starting  from  the  CVA-AIC  estimatse.  In 
addition,  extension  to  time  varying  parameters  and  recursive 
parameter  estimation  schems  are  described.  The  E-M 
algorithm  presents  a  new  way  of  analyzing  and  extending  the 
CVA  technique.  At  the  same  time,  it  unifies  different 
techniques  of  adaptive  time  series  analysis  and  shows 
clearly  the  relationship  between  them. 


9.3  CONCLUSIONS: 

The  investigations  under  the  scope  of  this  project  have 
enhanced  our  understanding  of  how  to  analyze  a  time  varying 
system  in  a  predictive  inference  and  entropy  framework.  Our 
attention  was  focused  mainly  on  the  CVA-AIC  technique  and 
its  relationship  to  other  techniques  for  system 
identification.  The  major  conclusions  of  this  project  are: 


9-5 


(i)  The  problem  posed  by  Akaike  (for  computing  AIC) 
is  asymptotically  equivalent  to  maximum  likelihood 
estimation  problem  when  the  parameter  dimension  is  known. 

(ii)  The  CVA-AIC  technique  can  be  extended  using  the 
E-M  Algorithm  in  such  a  way  that  the  model  will  converge 
monotonically  towards  the  ML  estimates. 

(iii)  The  problem  of  entropy  maximization  can  be  posed 
either  for  the  continuous  time  or  discrete  time  stochastic 
processes . 

(iv)  The  CVA  theory  has  been  extended  to  include  more 
general  type  of  time  series.  The  extended  theory  is  known 
as  "Generalized  CVA  Theory"  and  can  handle  time  series  with 
singular  covariance  matrices. 

(v)  In  the  CVA-AIC  framework,  the  model  can  be 
identified  either  in  the  standard  state  space  form  or  in  the 
Kalman  Filter  form,  depending  upon  the  type  of  application 
at  hand. 

(vi)  The  standard  state-space  model  of  various  orders 
can  be  computed  recursively  starting  from  order  one. 

(vii)  Once  a  state-space  model  is  identified,  the 
system  transfer  function,  noise  power  spectrum  and 
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associated  bands  cf  various  confidence  levels  can  be 
computed . 

(viii)  An  abrupt  change  in  a  model  can  be  detected  in  a 
CVA-AIC  framework  by  partitioning  the  data  into  various 
segments  and  comparing  the  value  of  the  optimal  AIC  from 
various  segments.  This  technique  can  be  used  as  a 
generalized  fault-detection  scheme. 

(ix)  E-M  algorithmic  approach  provides  a  very  general 
framework  where  most  of  the  estimation  problem  can  be 
formulated.  If  the  CVA-AIC  technique  is  embedded  properly 
in  an  E-M  framework,  it  would  leads  to  maximum  likelihood 
estimates. 

9.4  FUTURE  DIRECTION: 


Considerable  theoretical  analysis  has  been  done  in  this 
project,  yet  more  is  needed  to  understand  the  CVA-AIC  and 
the  E-M  techniques  fully.  Also  future  efforts  should  be 
directed  towards  practical  implementation  issues.  It  is 
recommended  that  the  future  work  in  this  area  should 
include,  but  not  necessarily  be  limited  to,  the  following 
items : 

(a)  It  has  been  reported  that  the  CVA-AIC  technique 
produces  estimates  that  are  close  to  MLE  estimates.  In 
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fact,  it  can  be  shown  for  independent  and  identically 
distributed  random  variables,  that  the  estimate  generated  by 
the  AIC  criterion  is  asymptotically  equal  to  the  MLE 
estimate.  More  theoretical  investigation  is  needed  to 
assess  the  performance  of  the  generalized  CVA-AIC  technique 
relative  to  the  MLE  technique  in  the  general  Adaptive  Time 
Series  setting. 

(b)  Currently,  CVA-AIC  technique  has  been  implemented 
in  a  batch  mode  and  is  computationally  demanding.  The 
technique  is  suitable  for  a  slowly  time  varying  system  where 
the  model  may  need  periodic  updating.  For  systems  with  fast 
parameter  changes,  a  recursive  form  of  CVA-AIC  technique 
needs  to  be  developed  so  that  the  it  can  be  used  in  real 
time.  We  have  developed  approaches  in  this  direction  using 
the  E-M  Algorithmic  approach. 

(c)  It  has  been  theoretically  demonstrated  here  how  to 
compute  the  bounds  at  various  confidence  levels  on  the 
systems  transfer  function  and  the  noise  power  spectral 
density.  These  algorithms  need  to  be  implemented  and 
verified  via  Monte  Carlo  simulations. 

(d)  The  state  space  matrices  obtained  from  the 
existing  CVA-AIC  algorithm  are  not  in  any  particular 
canonical  form  and  may  be  over-parametrized.  The  effect  of 
this  over-parametrization  on  the  efficiency  of  the  estimates 


finding  the  model  -  it  should  be  investigated  whether  this 
can  be  replaced  by  other  algorithms  with  lesser 
computational  burden. 

(i)  The  combination  of  CVA  and  E-M  Algorithm  based 
approaches  derived  in  this  report  for  time  varying  and 
constant  systems  should  be  tested  on  practical  examples  of 
increasing  complexity.  These  techniques  are  extremely 
promising  for  solving  the  general  Multivariate  Adaptive 
Times  Series  Identification  problem 
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APPENDIX 

MAXIMUM  LIKELIHOOD  ESTIMATION 

We  present  here  some  basic  background  on  maximum  likelihood  estimation, 
which  is  used  throughout  this  report. 

The  likelihood  function  for  a  sample  xj,  X2  >  •  •  •  xn  parametrized  by  a 
parameter  0  is 

L  =  tt  p(Xi  I  0)  (A.l) 

i=l 

Assume  the  x^  are  drawn  independently  from  the  true  distribution 
p(x  |  0q)-  Then  L  is  the  joint  distribution  function  of  x^ ,  X2 ,  .  .  .  ,  xn 
and 
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Differentiating  wrt  0: 
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Differentiating  again 
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where  0  is  the  maximum  likelihood  estimate  which  satisfies 
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and  0q  is  the  true  parameter  value. 
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Now  define  Che  covariance  matrix 
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and  factor  C  as  C  =  W  W 
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The  right  hand  side  is  approximated  as 
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The  left  hand  side  is  a  normalized  gaussian  variate  since 
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Thus,  the  right  hand  side  is  also  a  normalized  gaussian  variate  and 
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C  is  the  Fisher  information  matrix,  which  is  the  inverse  of  Che  covariance 
matrix  of  the  parameter  estimation  errors. 
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APPENDIX  B:  An  Innovations  Approach  to  Maximum  Likelihood  Identification  of  Linear 
and  XonLinear  Dynamic  Systems 

This  appendix  presents  an  approach  to  maximum  likelihood  identification  of  multi-input 
multi— output  linear  and  nonlinear  dynamic  systems  with  arbitrary  inputs.  The  approach 
is  based  on  state  vector  formulation  and  uses  the  innovation  properties  of  optimal  filters  for 
these  systems.  Application  to  the  identification  of  the  transfer  function  of  a  chemical 
reactor  is  considered. 

1.  Introduction 

The  maximum  likelihood  estimation  of  autoregressive  and  moving  average 
parameters  in  time  series  analysis  has  been  considered  by  several  investigators  [1,2].*  The 
related  problem  of  linear  system  identification  can  often  be  cast  in  this  framework,  though 
the  parameter  transformations  invloved  may  be  nonlinear  and  nonunique.  Special 
difficulties  are  encou^te^d  in  handling  multi-input  multi-output  linear  models  and 
nonlinear  models  using  the  time-series  approach.  The  author  [3,4]  has  tried  to  circumvent 
these  difficulties  by  working  directly  with  the  physical  models  and  using  the  innovations 
approach  of  Kailath  [5,6].  A  schematic  diagram  of  this  method  is  shown  in  Figure  B.l. 


measurement  soise 


Figure  B.l  Implementation  of  maximum  likelihood  estimator 
*References  for  Appendix  B  are  given  separately  at  the  end. 
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2.  Linear  Svstems 


Consider  a  discrete— time  linear  system* 

(B.l)  x(t  r  1)  =  Fx(t)  -f  Gu(t)  -I-  Fw(t) 

(B.2)  y(0  =  Hx(t)  4-  v(t) 

where 

x(t)  =  n  x  1  state  vector;  u(t)  =  p  x  1  input  vector; 

w(t)  =  q  x  1  vector  of  random  forcing  functions; 

y(t)  =  r  x  1  output  vectors;  and  v(t)  =  r  x  1  vector  of  output  errors 

and 

E{w(t)}  =  o,  E{w(t)wT(r)}  =  Q  S^T 

t 

where  5,  is  the  Kronecker  delta  function. 

E{w(t)vT(t)}  =  0 

E{v(t)}  =  0,  E{v(t)vT(r)}  =  Rit_r 

★References  for  Appendix  B  are  given  separately  at  the  end. 
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It  is  assumed  that  the  structure  of  the  model  is  known.  The  vector  of  unknown  parameters 
from  F,G,r,H,Q  and  R  isdenoted  by  0.  It  is  assumed  tht  0  is  identifiable. 

The  ML  estimate  of  0  is  given  by 

(B.3)  0  =  Arg  {max  log  p(Y N/0)} 

0 


where 


yn  =  (y(i) . ,y(N)} 


and 


p(Yj y/0)  =  conditional  probaability  density  of  given  0. 
An  expression  for  p(Y -^/O)  is  derived  as 

p(YN/«)  =  p(y(i). . ,y(N)/0) 

=  p(y(N)|YN_1.«)p(YN_1|0) 

=  P(y(N)|YN_1,«)p(y(N  -  1)|Yn_2,0)p(Yn_2|0) 


=  S  p(y(j)|Y  ,0). 
j=i  j-i 
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Therefore 


N 

(B.4)  log  p(YN|  0)  =  S  log  p(y(j)  |  Y  , 0 ) 

1  j=l  j-1 


Consider  the  case  in  which  x(0),  w(t)  and  v(t)  are  normally  distributed.  Then 
p(y(j)  |  Yj  p 0 )  by  a  well-known  property  of  normal  distributions  is  also  normal. 


Let 


(B.5)  E{y(j)|  Y  ,0}  =  y(j|j  -  1) 

j-1 

and 

(B.6)  Cov{y(j)|Yj_1,^}  =  B(j|j-l). 

It  is  known  that  y(j|  j-1)  and  B(j|  j-1)  can  be  obtained  from  a  Kalman  filter  [7]  of  the 
following  form: 

(B.7)  x(t  +  1/t  )  =  Fx(t/t)  +  Gu(t) 

(B.8)  x(t/t)  =  x(t/t  —  1)  +  K(t)Kt) 

(b.9)  Kt)  =  y(t)  -Hx(t/t  -  1) 

(B.10)  K(t)  =  P(t/t  -  l)HTB-1(t/t  -  1) 
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(B  11) 


B(t)  =  HP(t/t  -  1)HT  +  R 


(B.  12) 

P(t/t  =  (I  —  K(t)H)P(t/t  -  1) 

(B.  13) 

p(t  +  i/t)  =  FP(t/t)FT  +  rQrT 

The  likelihood  function  (B.4)  can  now  abe  written  as 

N 

(B.  14)  log  p(YN |  (?)  =  -!  E  K(j)B-1(j/j  -  iMj)  +  log|B(j/j  —  1)|], 

2  . 
j=l 

Here  v(t)  denotes  the  innovation  sequence  which  is  zero  mean,  Gaussian  and  white  [5].  ML 
estimate  0is  obtained  by  maaximiing  (B.14)  with  respect  to  0  subject  to  the  constraints 
(B.7) — (B.  13).  This  is  a  very  difficult  optimization  problem.  An  approximation  suggested 
in  Ref.  (3)  simplifies  the  problem  tremendously.  It  is  assumed  that  the  filter  gain  K(t)  and 
covariance  B(t/t  —  1)  have  reached  constant  values  K  and  B  and  the  vector  0  consists  of 
unknown  parameters  from  F,G,K  and  B  only.  Then 

N 

(B.  15)  logp(YN|«  =  -i  E  K(j)B_1*(j)  +  log|B|). 

2  ; _ i 


Maximizing  (B.15)  over  B,  produces 

-  i  N  *  T 

(B.  16)  B  =  _  S  Kj|«)vi(j|tt) 

N  j=l 

where  a  is  the  ML  estimate  of  unknowns  in  F,  G  and  A.  It  is  given  by  the  root  of  the 
equation 


B-5 


N  T  ,  d  v  (  j  ) 

(B.  17)  E  /'(j)B  1 _ =0 

j=l  da 

where  ( dv(]))/da  is  calculated  from  equations  (B.7)— (B.9).  The  root  of  equation(B.17)  is 
found  by  a  Newton— Raphson  or  Gauss— Newton  iteration.  Once  a  is  obtained,  r,Q  and  R 
are  obtained  from  equations  (10)— (13).  In  this  way,  the  non-linear  constraints  of 
equations  (10)— (13)  are  avoided  during  optimization.  The  above  method  is  no  more 
complicated  than  the  well-known  output  error  method.  In  fact,  it  reduces  to  the  output 
error  method  when  there  is  no  process  noise,  i.e.,  w(t)  =  0.  In  that  case,  Q  =  0,  K  =  0  and 
"(t)  =  y(t)  —  Hx(t)  is  the  output  error.  A  flow  chart  of  the  method  is  shown  in  Figure  B.2. 

3.  Nonlinear  Systems 

Consider  a  nonlinear  dynamic  system 

(B.18)  x(t  +  1)  =  f(x(t),  0,u(t))  +  Tw(t) 

(B.19)  y(t)  =  h(x(t))  +  v(t) 

where  f(.)  and  h(.)  are  n  x  1  and  rxl  vectors  of  nonlinear  functions.  Also,  w(t)  and  v(t) 
are  Gaussian  white  noise  sequences  with  zero  mean  and  covariances  Q  and  R. 

The  evaluation  of  the  trule  ML  estimate  would  require  the  calculation  of 
p(y(j)  |  Yj _p0)  using  an  optimal  nonlinear  filter.  Since  this  is  computaationally  infeasible, 
we  approximate  p(y(j)|Yj_p  0)  by  a  Gaussian  density  with  mean  and  covariance  obtained 
from  an  Extended  Kalman  Filter  [8]  of  the  following  form: 
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